You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mesos.apache.org by Carlos Torres <ca...@RACKSPACE.COM> on 2015/07/01 20:17:52 UTC

[Question] Distributed Load Testing with Mesos and Gatling

Hi all,

In the past weeks, I've been thinking in leveraging Mesos to schedule distributed load tests. Recently, the Kubernetes community recently shared one way to accomplish this (here: https://cloud.google.com/solutions/distributed-load-testing-using-kubernetes).

One problem, at least for me, with this approach is that the load testing tool needs to coordinate the distributed scenario, and combine the data, if it doesn't, then the load clients will trigger at different times, and then later an aggregation step of the data would be handled by the user, or some external batch job, or script. This is not a problem for load generators like Tsung, or Locust, but could be a little more complicated for Gatling, since they already provide a distributed model, and coordinate the distributed tasks, and Gatling does not. To me, the approach the Kubernetes team suggests is really a hack using the 'ReplicationController' to spawn multiple replicas, which could be easily achieved using the same approach with Marathon (or Kubernetes on Mesos).

I was thinking of building a Mesos framework, that would take the input, or load simulation file, and would schedule jobs across the cluster (perhaps with dedicated resources too minimize variance) using Gatling. A Mesos framework will be able to provide a UI/API to take the input jobs, and report status of multiple jobs. It can also provide a way to sync/orchestrate the simulation, and finally provide a way to aggregate the simulation data in one place, and serve the generated HTML report.

Boiled down to its primitive parts, it would spin multiple Gatling (java) processes across the cluster, use something like a barrier (not sure what to use here) to wait for all processes to be ready to execute, and finally copy, and rename the generated simulations logs from each Gatling process to one node/place, that is finally aggregated and compiled to HTML report by a single Gatling process.

First of all, is there anything in the Mesos community that does this already? If not, do you think this is feasible to accomplish with a Mesos framework, and would you recommend to go with this approach? Does Mesos offers a barrier-like features to coordinate jobs, and can I somehow move files to a single node to be processed?

Finally, I've never written a non-trivial Mesos framework, how should I go about, or find more documentation, to get started? I'm looking for best practices, pitfalls, etc.

Thank you for your time,
Carlos

Re: [Question] Distributed Load Testing with Mesos and Gatling

Posted by haosdent <ha...@gmail.com>.

You idea sounds cool. As I know, mesos don't have distributed test
framework now. Maybe other user or developers would know about it.

If you want to learn how to write a framework. I think you could start from
this: http://mesos.apache.org/documentation/latest/
And mesos repo also have some examples to show how to write a framework.
https://github.com/apache/mesos/tree/master/src/examples.

On Thu, Jul 2, 2015 at 2:17 AM, Carlos Torres <ca...@rackspace.com>
wrote:

> Hi all,
>
> In the past weeks, I've been thinking in leveraging Mesos to schedule
> distributed load tests. Recently, the Kubernetes community recently shared
> one way to accomplish this (here:
> https://cloud.google.com/solutions/distributed-load-testing-using-kubernetes
> ).
>
> One problem, at least for me, with this approach is that the load testing
> tool needs to coordinate the distributed scenario, and combine the data, if
> it doesn't, then the load clients will trigger at different times, and then
> later an aggregation step of the data would be handled by the user, or some
> external batch job, or script. This is not a problem for load generators
> like Tsung, or Locust, but could be a little more complicated for Gatling,
> since they already provide a distributed model, and coordinate the
> distributed tasks, and Gatling does not. To me, the approach the Kubernetes
> team suggests is really a hack using the 'ReplicationController' to spawn
> multiple replicas, which could be easily achieved using the same approach
> with Marathon (or Kubernetes on Mesos).
>
> I was thinking of building a Mesos framework, that would take the input,
> or load simulation file, and would schedule jobs across the cluster
> (perhaps with dedicated resources too minimize variance) using Gatling.  A
> Mesos framework will be able to provide a UI/API to take the input jobs,
> and report status of multiple jobs. It can also provide a way to
> sync/orchestrate the simulation, and finally provide a way to aggregate the
> simulation data in one place, and serve the generated HTML report.
>
> Boiled down to its primitive parts, it would spin multiple Gatling (java)
> processes across the cluster, use something like a barrier (not sure what
> to use here) to wait for all processes to be ready to execute, and finally
> copy, and rename the generated simulations logs from each Gatling process
> to one node/place, that is finally aggregated and compiled to HTML report
> by a single Gatling process.
>
> First of all, is there anything in the Mesos community that does this
> already? If not, do you think this is feasible to accomplish with a Mesos
> framework, and would you recommend to go with this approach? Does Mesos
> offers a barrier-like features to coordinate jobs, and can I somehow move
> files to a single node to be processed?
>
> Finally, I've never written a non-trivial Mesos framework, how should I go
> about, or find more documentation, to get started? I'm looking for best
> practices, pitfalls, etc.
>
>
> Thank you for your time,
> Carlos
>
>
>


-- 
Best Regards,
Haosdent Huang

Re: COMMERCIAL:Re: [Question] Distributed Load Testing with Mesos and Gatling

Posted by Carlos Torres <ca...@RACKSPACE.COM>.

Yes, I agree, I think starting out with the scale-out approach, while naive, it will be a good starting point.


I actually have this automated with Jenkins, and a bunch of dedicated slaves, using the Workflow plugin, it works kind of OK since I can't really control their execution.


If you are interested, here's my workflow script for Jenkins: https://github.com/meteorfox/gatling-workflow/blob/master/gatling_flow.groovy


-- Carlos

________________________________
From: Joao Ribeiro <jo...@gmail.com>
Sent: Thursday, July 2, 2015 11:33 AM
To: user@mesos.apache.org
Subject: COMMERCIAL:Re: [Question] Distributed Load Testing with Mesos and Gatling

This sounds like a really cool project.

I am still a very green user of mesos and never used gatling at all but a quick search took me to http://gatling.io/docs/2.1.6/cookbook/scaling_out.html

With this it sound’t be took difficult to create a master/slave or scheduler/executors approach where you would have the master launch several slaves to do the work, wait for it to finish, collect logs and generate the report.
For better synchronisation you could make the slaves register to zookeeper while master waits for all slaves to be up and trigger a “start test” command on all slaves simultaneously.
You then could easily time out if it takes too long to get all slaves up or use other more fault tolerant strategies. i.e.: run slaves that you got; bump each slave that is up with more load to try to make up for missing slaves;

It might be a naive approach but would be a starting point in my opinion.

On 02 Jul 2015, at 18:00, CCAAT <cc...@tampabay.rr.com>> wrote:

On 07/01/2015 01:17 PM, Carlos Torres wrote:
Hi all,

In the past weeks, I've been thinking in leveraging Mesos to schedule distributed load tests.

An excellent idea.

One problem, at least for me, with this approach is that the load testing tool needs to coordinate
the distributed scenario, and combine the data, if it doesn't, then the load clients will trigger at
different times, and then later an aggregation step of the data would be handled by the user, or
some external batch job, or script. This is not a problem for load generators like Tsung, or Locust,
but could be a little more complicated for Gatling, since they already provide a distributed model,
and coordinate the distributed tasks, and Gatling does not. To me, the approach the Kubernetes team
suggests is really a hack using the 'Replication Controller' to spawn multiple replicas, which could
be easily achieved using the same approach with Marathon (or Kubernetes on Mesos).

I was thinking of building a Mesos framework, that would take the input, or load simulation file,
and would schedule jobs across the cluster (perhaps with dedicated resources too minimize variance)
using Gatling.  A Mesos framework will be able to provide a UI/API to take the input jobs, and
report status of multiple jobs. It can also provide a way to sync/orchestrate the simulation, and
finally provide a way to aggregate the simulation data in one place, and serve the generated HTML
report.

Boiled down to its primitive parts, it would spin multiple Gatling (java) processes across the
cluster, use something like a barrier (not sure what to use here) to wait for all processes to
be ready to execute, and finally copy, and rename the generated simulations logs from each
Gatling process to one node/place, that is finally aggregated and compiled to HTML report by a
single Gatling process.

First of all, is there anything in the Mesos community that does this already? If not, do you
think this is feasible to accomplish with a Mesos framework, and would you recommend to go with this
approach? Does Mesos offers a barrier-like features to coordinate jobs, and can I somehow move
files to a single node to be processed?

This all sounds workable, but, I do not have all the experiences necessary to qualify your ideas. What I would suggest is a solution that lends itself to testing similarly configured cloud/cluster offerings, so we the cloud/cluster community has a way to test and evaluate   new releases, substitute component codes, forks and even competitive offerings. A ubiquitous  and robust testing semantic based on your ideas does seem to be an overwhelmingly positive idea, imho. As such some organizational structures to allow results to be maintained and quickly compared to other 'test-runs' would greatly encourage usage.
Hopefully 'Gatling' and such have many, if not most of the features needed to automate the evaluation of results.


Finally, I've never written a non-trivial Mesos framework, how should I go about, or find more
documentation, to get started? I'm looking for best practices, pitfalls, etc.


Thank you for your time,
Carlos

hth,
James

Re: [Question] Distributed Load Testing with Mesos and Gatling

Posted by Ben Whitehead <be...@mesosphere.io>.

Hi Carlos,

This sounds like a great idea and something that many people would like to
be able to leverage.

Mesosphere has a pretty good starting example that could be used to
bootstrap the process of authoring your Gatling Framework.  The project is
called Rendler[1] and has several different language implementations.
Rendler is essentially a distributed web page renderer.

Hope this helps,
Ben Whitehead


[1] https://github.com/mesosphere/RENDLER/tree/master/java

On Thu, Jul 2, 2015 at 9:33 AM, Joao Ribeiro <jo...@gmail.com> wrote:

> This sounds like a really cool project.
>
> I am still a very green user of mesos and never used gatling at all but a
> quick search took me to
> http://gatling.io/docs/2.1.6/cookbook/scaling_out.html
>
> With this it sound’t be took difficult to create a master/slave or
> scheduler/executors approach where you would have the master launch several
> slaves to do the work, wait for it to finish, collect logs and generate the
> report.
> For better synchronisation you could make the slaves register to zookeeper
> while master waits for all slaves to be up and trigger a “start test”
> command on all slaves simultaneously.
> You then could easily time out if it takes too long to get all slaves up
> or use other more fault tolerant strategies. i.e.: run slaves that you got;
> bump each slave that is up with more load to try to make up for missing
> slaves;
>
> It might be a naive approach but would be a starting point in my opinion.
>
> On 02 Jul 2015, at 18:00, CCAAT <cc...@tampabay.rr.com> wrote:
>
> On 07/01/2015 01:17 PM, Carlos Torres wrote:
>
> Hi all,
>
> In the past weeks, I've been thinking in leveraging Mesos to schedule
> distributed load tests.
>
>
> An excellent idea.
>
>
> One problem, at least for me, with this approach is that the load testing
> tool needs to coordinate
> the distributed scenario, and combine the data, if it doesn't, then the
> load clients will trigger at
> different times, and then later an aggregation step of the data would be
> handled by the user, or
> some external batch job, or script. This is not a problem for load
> generators like Tsung, or Locust,
> but could be a little more complicated for Gatling, since they already
> provide a distributed model,
> and coordinate the distributed tasks, and Gatling does not. To me, the
> approach the Kubernetes team
> suggests is really a hack using the 'Replication Controller' to spawn
> multiple replicas, which could
> be easily achieved using the same approach with Marathon (or Kubernetes on
> Mesos).
>
>
> I was thinking of building a Mesos framework, that would take the input,
> or load simulation file,
> and would schedule jobs across the cluster (perhaps with dedicated
> resources too minimize variance)
> using Gatling.  A Mesos framework will be able to provide a UI/API to take
> the input jobs, and
> report status of multiple jobs. It can also provide a way to
> sync/orchestrate the simulation, and
> finally provide a way to aggregate the simulation data in one place, and
> serve the generated HTML
> report.
>
>
> Boiled down to its primitive parts, it would spin multiple Gatling (java)
> processes across the
> cluster, use something like a barrier (not sure what to use here) to wait
> for all processes to
> be ready to execute, and finally copy, and rename the generated
> simulations logs from each
> Gatling process to one node/place, that is finally aggregated and compiled
> to HTML report by a
> single Gatling process.
>
>
> First of all, is there anything in the Mesos community that does this
> already? If not, do you
> think this is feasible to accomplish with a Mesos framework, and would you
> recommend to go with this
> approach? Does Mesos offers a barrier-like features to coordinate jobs,
> and can I somehow move
> files to a single node to be processed?
>
>
> This all sounds workable, but, I do not have all the experiences necessary
> to qualify your ideas. What I would suggest is a solution that lends itself
> to testing similarly configured cloud/cluster offerings, so we the
> cloud/cluster community has a way to test and evaluate   new releases,
> substitute component codes, forks and even competitive offerings. A
> ubiquitous  and robust testing semantic based on your ideas does seem to be
> an overwhelmingly positive idea, imho. As such some organizational
> structures to allow results to be maintained and quickly compared to other
> 'test-runs' would greatly encourage usage.
> Hopefully 'Gatling' and such have many, if not most of the features needed
> to automate the evaluation of results.
>
>
> Finally, I've never written a non-trivial Mesos framework, how should I go
> about, or find more
> documentation, to get started? I'm looking for best practices, pitfalls,
> etc.
>
>
> Thank you for your time,
> Carlos
>
>
> hth,
> James
>
>
>

Re: [Question] Distributed Load Testing with Mesos and Gatling

Posted by Joao Ribeiro <jo...@gmail.com>.

This sounds like a really cool project.

I am still a very green user of mesos and never used gatling at all but a quick search took me to http://gatling.io/docs/2.1.6/cookbook/scaling_out.html <http://gatling.io/docs/2.1.6/cookbook/scaling_out.html>

With this it sound’t be took difficult to create a master/slave or scheduler/executors approach where you would have the master launch several slaves to do the work, wait for it to finish, collect logs and generate the report.
For better synchronisation you could make the slaves register to zookeeper while master waits for all slaves to be up and trigger a “start test” command on all slaves simultaneously.
You then could easily time out if it takes too long to get all slaves up or use other more fault tolerant strategies. i.e.: run slaves that you got; bump each slave that is up with more load to try to make up for missing slaves;

It might be a naive approach but would be a starting point in my opinion.

> On 02 Jul 2015, at 18:00, CCAAT <cc...@tampabay.rr.com> wrote:
> 
> On 07/01/2015 01:17 PM, Carlos Torres wrote:
>> Hi all,
>> 
>> In the past weeks, I've been thinking in leveraging Mesos to schedule distributed load tests.
> 
> An excellent idea.
>> 
>> One problem, at least for me, with this approach is that the load testing tool needs to coordinate
>> the distributed scenario, and combine the data, if it doesn't, then the load clients will trigger at
>> different times, and then later an aggregation step of the data would be handled by the user, or
>> some external batch job, or script. This is not a problem for load generators like Tsung, or Locust,
>> but could be a little more complicated for Gatling, since they already provide a distributed model,
>> and coordinate the distributed tasks, and Gatling does not. To me, the approach the Kubernetes team
>> suggests is really a hack using the 'Replication Controller' to spawn multiple replicas, which could
>> be easily achieved using the same approach with Marathon (or Kubernetes on Mesos).
> 
>> I was thinking of building a Mesos framework, that would take the input, or load simulation file,
>> and would schedule jobs across the cluster (perhaps with dedicated resources too minimize variance)
>> using Gatling.  A Mesos framework will be able to provide a UI/API to take the input jobs, and
>> report status of multiple jobs. It can also provide a way to sync/orchestrate the simulation, and
>> finally provide a way to aggregate the simulation data in one place, and serve the generated HTML
>> report.
> 
>> Boiled down to its primitive parts, it would spin multiple Gatling (java) processes across the
>> cluster, use something like a barrier (not sure what to use here) to wait for all processes to
>> be ready to execute, and finally copy, and rename the generated simulations logs from each
>> Gatling process to one node/place, that is finally aggregated and compiled to HTML report by a
>> single Gatling process.
> 
>> First of all, is there anything in the Mesos community that does this already? If not, do you
>> think this is feasible to accomplish with a Mesos framework, and would you recommend to go with this
>> approach? Does Mesos offers a barrier-like features to coordinate jobs, and can I somehow move
>> files to a single node to be processed?
> 
> This all sounds workable, but, I do not have all the experiences necessary to qualify your ideas. What I would suggest is a solution that lends itself to testing similarly configured cloud/cluster offerings, so we the cloud/cluster community has a way to test and evaluate   new releases, substitute component codes, forks and even competitive offerings. A ubiquitous  and robust testing semantic based on your ideas does seem to be an overwhelmingly positive idea, imho. As such some organizational structures to allow results to be maintained and quickly compared to other 'test-runs' would greatly encourage usage.
> Hopefully 'Gatling' and such have many, if not most of the features needed to automate the evaluation of results.
> 
> 
>> Finally, I've never written a non-trivial Mesos framework, how should I go about, or find more
>> documentation, to get started? I'm looking for best practices, pitfalls, etc.
>> 
>> 
>> Thank you for your time,
>> Carlos
> 
> hth,
> James
>

Re: COMMERCIAL:Re: COMMERCIAL:Re: [Question] Distributed Load Testing with Mesosand Gatling

Posted by Carlos Torres <ca...@RACKSPACE.COM>.

________________________________________
From: CCAAT <cc...@tampabay.rr.com>
Sent: Thursday, July 2, 2015 2:03 PM
To: user@mesos.apache.org
Cc: ccaat@tampabay.rr.com
Subject: COMMERCIAL:Re: COMMERCIAL:Re: [Question] Distributed Load Testing with Mesosand Gatling

On 07/02/2015 12:10 PM, Carlos Torres wrote:
> ________________________________________
> From: CCAAT <cc...@tampabay.rr.com>
> Sent: Thursday, July 2, 2015 12:00 PM
> To: user@mesos.apache.org
> Cc: ccaat@tampabay.rr.com
> Subject: COMMERCIAL:Re: [Question] Distributed Load Testing with Mesos and Gatling
>
> On 07/01/2015 01:17 PM, Carlos Torres wrote:
>> Hi all,
>>
>> In the past weeks, I've been thinking in leveraging Mesos to schedule distributed load tests.
>
> An excellent idea.
>>
>> One problem, at least for me, with this approach is that the load testing tool needs to coordinate
>> the distributed scenario, and combine the data, if it doesn't, then the load clients will trigger at
>> different times, and then later an aggregation step of the data would be handled by the user, or
>> some external batch job, or script. This is not a problem for load generators like Tsung, or Locust,
>> but could be a little more complicated for Gatling, since they already provide a distributed model,
>> and coordinate the distributed tasks, and Gatling does not. To me, the approach the Kubernetes team
>> suggests is really a hack using the 'Replication Controller' to spawn multiple replicas, which could
>> be easily achieved using the same approach with Marathon (or Kubernetes on Mesos).
>
>> I was thinking of building a Mesos framework, that would take the input, or load simulation file,
>> and would schedule jobs across the cluster (perhaps with dedicated resources too minimize variance)
>> using Gatling.  A Mesos framework will be able to provide a UI/API to take the input jobs, and
>> report status of multiple jobs. It can also provide a way to sync/orchestrate the simulation, and
>> finally provide a way to aggregate the simulation data in one place, and serve the generated HTML
>> report.
>
>> Boiled down to its primitive parts, it would spin multiple Gatling (java) processes across the
>> cluster, use something like a barrier (not sure what to use here) to wait for all processes to
>> be ready to execute, and finally copy, and rename the generated simulations logs from each
>> Gatling process to one node/place, that is finally aggregated and compiled to HTML report by a
>> single Gatling process.
>
>> First of all, is there anything in the Mesos community that does this already? If not, do you
>> think this is feasible to accomplish with a Mesos framework, and would you recommend to go with this
>> approach? Does Mesos offers a barrier-like features to coordinate jobs, and can I somehow move
>> files to a single node to be processed?
>
> This all sounds workable, but, I do not have all the experiences
> necessary to qualify your ideas. What I would suggest is a solution that
> lends itself to testing similarly configured cloud/cluster offerings, so
> we the cloud/cluster community has a way to test and evaluate   new
> releases, substitute component codes, forks and even competitive
> offerings. A ubiquitous  and robust testing semantic based on your ideas
> does seem to be an overwhelmingly positive idea, imho. As such some
> organizational structures to allow results to be maintained and quickly
> compared to other 'test-runs' would greatly encourage usage.
> Hopefully 'Gatling' and such have many, if not most of the features
> needed to automate the evaluation of results.
>
>
>> Finally, I've never written a non-trivial Mesos framework, how should I go about, or find more
>> documentation, to get started? I'm looking for best practices, pitfalls, etc.
>>
>>
>> Thank you for your time,
>> Carlos
>
> hth,
> James
>
>
> Thanks for your feedback.
>
> I like your idea about having the ability to swap out the different components (e.g. load generators) and perhaps even providing an abstraction on the charting, and data reporting mechanism.
>
> I'll probably start with the simplest way possible, though, having the framework deploy Gatling across the cluster, in a scale-out fashion, and retrieve each instance results. Once I got that working then I'll start experimenting with abstracting out certain functionality.
>
> I know Twitter has a distributed load generator, called Iago, that apparently works in Mesos, it'd be awesome, if any of its contributors chime in, and share what things worked great, good, and not so good.
>
>
> The few things I'm concern in terms of implementing such a framework in Mesos is:
>
> * Noisy neighbors, or resource isolation.
>      - Rationale: It can introduce noise to the results if load generator competes for shared resources (e.g. network) with others tasks.
>
> * Coordination of execution
>      - Rationale: Need the ability to control execution of groups of related tasks. User A submits simulation that might create 5 load clients (tasks?), right after that, User B submits a different simulation that creates 10 load clients. Ideally, all of User A load clients should be on independent nodes, and should not share the same slaves with User B load clients, if not enough slaves are available on the cluster, then User B's simulation queues, until slaves are available. There might be enough "resources" to create, and configure, some of User B load clients, but it will block and wait until all of its load clients are up and ready.
>
> * Storage
>      - Rationale: Some load generators might need to place certain files belonging to a particular simulation in a shared storage location, that is independent from other simulations. These files could be common configuration, and/or the simulation logs that might need post-processing.
>
> Thanks
> Carlos Torres

I use 'iotop' on individual systems. It would be an interesting project
to find something like iotop  to monitor the cluster/cloud.  iotop +
gatling does seem interesting. Maybe an iotop running on each system and
reporting to an aggregation mechanism that is more 'coarse grained'? I
just have not researched what exists already, what folks are willing to
share nor what is actually feasible. The good folks at amplabs may have
something useful, that is not pushed out yet for consumption? [1]  Do
not be surprise to discover that you are treading on what some consider
to be 'strategic' information; these metrics you seek to commoditize.

Not to drop names here, but I do think that (RS) Rackspace is far ahead
of their competitors in this area. They assured me that they could
robustly profile any complex (including big science) codes to recommend
the 'sweet spot' on price/performance/resource (including memory bound
Big Science) problems on any clustering codes I choose to run in their
datacenters, including mesos. They portray Extreme Confidence that the
(RS) guys have already solved these problem. It is of curious confidence
to me, though. Sorry for the digression here but 'benchmarking'
cluster/cloud components is long overdue for public vetting, imho. Most
make excuses why it's too hard (dog ate my home-work, daddy), type of
huey. You ideas, robustly brought to conclusion, will separate the men,
from the boys......

I'm working on building out  mesos clusters with cephfs on top of btrfs
based (minimized) systems. Surely both btrfs and cephfs offer mechanism
to help you characterize and monitor IO, that are  useful to your
efforts and ideas.  This however is a 'work in progress' atm, as many
components are just not mature enough, yet.

Good hunting!
James

[1] https://amplab.cs.berkeley.edu/      https://github.com/amplab

James

About iotop alternatives, take a look at Performance-CoPilot (PCP) from RedHat [1]. It is
a system performance analysis, and distributed collection framework, that can very efficiently
collect metrics from your whole cluster (pull based) on-demand, close to real-time (< 1 second).
It supports collecting metrics from various subsystems, including containers (i.e. cgroups) [2], perf_events,
and definitely blk io metrics.

Netflix has this nifty dashboard, called Vector[3], which I've also contributed a few patches, that you can use to pull
metrics and  visualize them in the browser that are collected from a pcp daemon. 

I think something like PCP can be the beginning of a datacenter wide profiler framework, a la Google[4]

Regarding, benchmarking cloud providers, *shameless-plug*, check out PerfKitBenchmarker[5], I'm a contributor to the
project (the rackspace support, still in review), it can let you very easily run a lot of benchmarks very easily on most
popular cloud providers, and there's a companion project, called PerfKitExplorer, that let's you visualize the data.

As far, if it's strategic or not, I'm being independent here, and soliciting information from my own motive, and not
Rackspace's. My goal is simple, to build a distributed load testing framework to generate loads against different
web app and services, initially, supporting Gatling as the load generator.

[1] http://www.pcp.io/
[2] http://www.pcp.io/docs/lab.containers.html
[3] https://github.com/Netflix/vector/wiki/Getting-Started
[4] http://research.google.com/pubs/pub36575.html
[5] https://github.com/GoogleCloudPlatform/PerfKitBenchmarker

-- Carlos Torres

Re: COMMERCIAL:Re: [Question] Distributed Load Testing with Mesos and Gatling

Posted by CCAAT <cc...@tampabay.rr.com>.

On 07/02/2015 12:10 PM, Carlos Torres wrote:
> ________________________________________
> From: CCAAT <cc...@tampabay.rr.com>
> Sent: Thursday, July 2, 2015 12:00 PM
> To: user@mesos.apache.org
> Cc: ccaat@tampabay.rr.com
> Subject: COMMERCIAL:Re: [Question] Distributed Load Testing with Mesos and Gatling
>
> On 07/01/2015 01:17 PM, Carlos Torres wrote:
>> Hi all,
>>
>> In the past weeks, I've been thinking in leveraging Mesos to schedule distributed load tests.
>
> An excellent idea.
>>
>> One problem, at least for me, with this approach is that the load testing tool needs to coordinate
>> the distributed scenario, and combine the data, if it doesn't, then the load clients will trigger at
>> different times, and then later an aggregation step of the data would be handled by the user, or
>> some external batch job, or script. This is not a problem for load generators like Tsung, or Locust,
>> but could be a little more complicated for Gatling, since they already provide a distributed model,
>> and coordinate the distributed tasks, and Gatling does not. To me, the approach the Kubernetes team
>> suggests is really a hack using the 'Replication Controller' to spawn multiple replicas, which could
>> be easily achieved using the same approach with Marathon (or Kubernetes on Mesos).
>
>> I was thinking of building a Mesos framework, that would take the input, or load simulation file,
>> and would schedule jobs across the cluster (perhaps with dedicated resources too minimize variance)
>> using Gatling.  A Mesos framework will be able to provide a UI/API to take the input jobs, and
>> report status of multiple jobs. It can also provide a way to sync/orchestrate the simulation, and
>> finally provide a way to aggregate the simulation data in one place, and serve the generated HTML
>> report.
>
>> Boiled down to its primitive parts, it would spin multiple Gatling (java) processes across the
>> cluster, use something like a barrier (not sure what to use here) to wait for all processes to
>> be ready to execute, and finally copy, and rename the generated simulations logs from each
>> Gatling process to one node/place, that is finally aggregated and compiled to HTML report by a
>> single Gatling process.
>
>> First of all, is there anything in the Mesos community that does this already? If not, do you
>> think this is feasible to accomplish with a Mesos framework, and would you recommend to go with this
>> approach? Does Mesos offers a barrier-like features to coordinate jobs, and can I somehow move
>> files to a single node to be processed?
>
> This all sounds workable, but, I do not have all the experiences
> necessary to qualify your ideas. What I would suggest is a solution that
> lends itself to testing similarly configured cloud/cluster offerings, so
> we the cloud/cluster community has a way to test and evaluate   new
> releases, substitute component codes, forks and even competitive
> offerings. A ubiquitous  and robust testing semantic based on your ideas
> does seem to be an overwhelmingly positive idea, imho. As such some
> organizational structures to allow results to be maintained and quickly
> compared to other 'test-runs' would greatly encourage usage.
> Hopefully 'Gatling' and such have many, if not most of the features
> needed to automate the evaluation of results.
>
>
>> Finally, I've never written a non-trivial Mesos framework, how should I go about, or find more
>> documentation, to get started? I'm looking for best practices, pitfalls, etc.
>>
>>
>> Thank you for your time,
>> Carlos
>
> hth,
> James
>
>
> Thanks for your feedback.
>
> I like your idea about having the ability to swap out the different components (e.g. load generators) and perhaps even providing an abstraction on the charting, and data reporting mechanism.
>
> I'll probably start with the simplest way possible, though, having the framework deploy Gatling across the cluster, in a scale-out fashion, and retrieve each instance results. Once I got that working then I'll start experimenting with abstracting out certain functionality.
>
> I know Twitter has a distributed load generator, called Iago, that apparently works in Mesos, it'd be awesome, if any of its contributors chime in, and share what things worked great, good, and not so good.
>
>
> The few things I'm concern in terms of implementing such a framework in Mesos is:
>
> * Noisy neighbors, or resource isolation.
>      - Rationale: It can introduce noise to the results if load generator competes for shared resources (e.g. network) with others tasks.
>
> * Coordination of execution
>      - Rationale: Need the ability to control execution of groups of related tasks. User A submits simulation that might create 5 load clients (tasks?), right after that, User B submits a different simulation that creates 10 load clients. Ideally, all of User A load clients should be on independent nodes, and should not share the same slaves with User B load clients, if not enough slaves are available on the cluster, then User B's simulation queues, until slaves are available. There might be enough "resources" to create, and configure, some of User B load clients, but it will block and wait until all of its load clients are up and ready.
>
> * Storage
>      - Rationale: Some load generators might need to place certain files belonging to a particular simulation in a shared storage location, that is independent from other simulations. These files could be common configuration, and/or the simulation logs that might need post-processing.
>
> Thanks
> Carlos Torres


I use 'iotop' on individual systems. It would be an interesting project 
to find something like iotop  to monitor the cluster/cloud.  iotop + 
gatling does seem interesting. Maybe an iotop running on each system and 
reporting to an aggregation mechanism that is more 'coarse grained'? I 
just have not researched what exists already, what folks are willing to 
share nor what is actually feasible. The good folks at amplabs may have 
something useful, that is not pushed out yet for consumption? [1]  Do 
not be surprise to discover that you are treading on what some consider 
to be 'strategic' information; these metrics you seek to commoditize.


Not to drop names here, but I do think that (RS) Rackspace is far ahead 
of their competitors in this area. They assured me that they could 
robustly profile any complex (including big science) codes to recommend 
the 'sweet spot' on price/performance/resource (including memory bound 
Big Science) problems on any clustering codes I choose to run in their 
datacenters, including mesos. They portray Extreme Confidence that the 
(RS) guys have already solved these problem. It is of curious confidence 
to me, though. Sorry for the digression here but 'benchmarking' 
cluster/cloud components is long overdue for public vetting, imho. Most 
make excuses why it's too hard (dog ate my home-work, daddy), type of 
huey. You ideas, robustly brought to conclusion, will separate the men, 
from the boys......



I'm working on building out  mesos clusters with cephfs on top of btrfs 
based (minimized) systems. Surely both btrfs and cephfs offer mechanism 
to help you characterize and monitor IO, that are  useful to your 
efforts and ideas.  This however is a 'work in progress' atm, as many 
components are just not mature enough, yet.


Good hunting!
James


[1] https://amplab.cs.berkeley.edu/      https://github.com/amplab


James

Re: COMMERCIAL:Re: [Question] Distributed Load Testing with Mesos and Gatling

Posted by Carlos Torres <ca...@RACKSPACE.COM>.

________________________________________
From: CCAAT <cc...@tampabay.rr.com>
Sent: Thursday, July 2, 2015 12:00 PM
To: user@mesos.apache.org
Cc: ccaat@tampabay.rr.com
Subject: COMMERCIAL:Re: [Question] Distributed Load Testing with Mesos and Gatling

On 07/01/2015 01:17 PM, Carlos Torres wrote:
> Hi all,
>
> In the past weeks, I've been thinking in leveraging Mesos to schedule distributed load tests.

An excellent idea.
>
> One problem, at least for me, with this approach is that the load testing tool needs to coordinate
> the distributed scenario, and combine the data, if it doesn't, then the load clients will trigger at
> different times, and then later an aggregation step of the data would be handled by the user, or
> some external batch job, or script. This is not a problem for load generators like Tsung, or Locust,
> but could be a little more complicated for Gatling, since they already provide a distributed model,
> and coordinate the distributed tasks, and Gatling does not. To me, the approach the Kubernetes team
> suggests is really a hack using the 'Replication Controller' to spawn multiple replicas, which could
> be easily achieved using the same approach with Marathon (or Kubernetes on Mesos).

> I was thinking of building a Mesos framework, that would take the input, or load simulation file,
> and would schedule jobs across the cluster (perhaps with dedicated resources too minimize variance)
> using Gatling.  A Mesos framework will be able to provide a UI/API to take the input jobs, and
> report status of multiple jobs. It can also provide a way to sync/orchestrate the simulation, and
> finally provide a way to aggregate the simulation data in one place, and serve the generated HTML
> report.

> Boiled down to its primitive parts, it would spin multiple Gatling (java) processes across the
> cluster, use something like a barrier (not sure what to use here) to wait for all processes to
> be ready to execute, and finally copy, and rename the generated simulations logs from each
> Gatling process to one node/place, that is finally aggregated and compiled to HTML report by a
> single Gatling process.

> First of all, is there anything in the Mesos community that does this already? If not, do you
> think this is feasible to accomplish with a Mesos framework, and would you recommend to go with this
> approach? Does Mesos offers a barrier-like features to coordinate jobs, and can I somehow move
> files to a single node to be processed?

This all sounds workable, but, I do not have all the experiences
necessary to qualify your ideas. What I would suggest is a solution that
lends itself to testing similarly configured cloud/cluster offerings, so
we the cloud/cluster community has a way to test and evaluate   new
releases, substitute component codes, forks and even competitive
offerings. A ubiquitous  and robust testing semantic based on your ideas
does seem to be an overwhelmingly positive idea, imho. As such some
organizational structures to allow results to be maintained and quickly
compared to other 'test-runs' would greatly encourage usage.
Hopefully 'Gatling' and such have many, if not most of the features
needed to automate the evaluation of results.


> Finally, I've never written a non-trivial Mesos framework, how should I go about, or find more
> documentation, to get started? I'm looking for best practices, pitfalls, etc.
>
>
> Thank you for your time,
> Carlos

hth,
James


Thanks for your feedback.

I like your idea about having the ability to swap out the different components (e.g. load generators) and perhaps even providing an abstraction on the charting, and data reporting mechanism.

I'll probably start with the simplest way possible, though, having the framework deploy Gatling across the cluster, in a scale-out fashion, and retrieve each instance results. Once I got that working then I'll start experimenting with abstracting out certain functionality.

I know Twitter has a distributed load generator, called Iago, that apparently works in Mesos, it'd be awesome, if any of its contributors chime in, and share what things worked great, good, and not so good.


The few things I'm concern in terms of implementing such a framework in Mesos is:

* Noisy neighbors, or resource isolation.
    - Rationale: It can introduce noise to the results if load generator competes for shared resources (e.g. network) with others tasks.

* Coordination of execution
    - Rationale: Need the ability to control execution of groups of related tasks. User A submits simulation that might create 5 load clients (tasks?), right after that, User B submits a different simulation that creates 10 load clients. Ideally, all of User A load clients should be on independent nodes, and should not share the same slaves with User B load clients, if not enough slaves are available on the cluster, then User B's simulation queues, until slaves are available. There might be enough "resources" to create, and configure, some of User B load clients, but it will block and wait until all of its load clients are up and ready.

* Storage 
    - Rationale: Some load generators might need to place certain files belonging to a particular simulation in a shared storage location, that is independent from other simulations. These files could be common configuration, and/or the simulation logs that might need post-processing.

Thanks
Carlos Torres

Re: [Question] Distributed Load Testing with Mesos and Gatling

Posted by CCAAT <cc...@tampabay.rr.com>.

On 07/01/2015 01:17 PM, Carlos Torres wrote:
> Hi all,
>
> In the past weeks, I've been thinking in leveraging Mesos to schedule distributed load tests.

An excellent idea.
>
> One problem, at least for me, with this approach is that the load testing tool needs to coordinate
> the distributed scenario, and combine the data, if it doesn't, then the load clients will trigger at
> different times, and then later an aggregation step of the data would be handled by the user, or
> some external batch job, or script. This is not a problem for load generators like Tsung, or Locust,
> but could be a little more complicated for Gatling, since they already provide a distributed model,
> and coordinate the distributed tasks, and Gatling does not. To me, the approach the Kubernetes team
> suggests is really a hack using the 'Replication Controller' to spawn multiple replicas, which could
> be easily achieved using the same approach with Marathon (or Kubernetes on Mesos).

> I was thinking of building a Mesos framework, that would take the input, or load simulation file,
> and would schedule jobs across the cluster (perhaps with dedicated resources too minimize variance)
> using Gatling.  A Mesos framework will be able to provide a UI/API to take the input jobs, and
> report status of multiple jobs. It can also provide a way to sync/orchestrate the simulation, and
> finally provide a way to aggregate the simulation data in one place, and serve the generated HTML
> report.

> Boiled down to its primitive parts, it would spin multiple Gatling (java) processes across the
> cluster, use something like a barrier (not sure what to use here) to wait for all processes to
> be ready to execute, and finally copy, and rename the generated simulations logs from each
> Gatling process to one node/place, that is finally aggregated and compiled to HTML report by a
> single Gatling process.

> First of all, is there anything in the Mesos community that does this already? If not, do you
> think this is feasible to accomplish with a Mesos framework, and would you recommend to go with this
> approach? Does Mesos offers a barrier-like features to coordinate jobs, and can I somehow move
> files to a single node to be processed?

This all sounds workable, but, I do not have all the experiences 
necessary to qualify your ideas. What I would suggest is a solution that 
lends itself to testing similarly configured cloud/cluster offerings, so 
we the cloud/cluster community has a way to test and evaluate   new 
releases, substitute component codes, forks and even competitive 
offerings. A ubiquitous  and robust testing semantic based on your ideas 
does seem to be an overwhelmingly positive idea, imho. As such some 
organizational structures to allow results to be maintained and quickly 
compared to other 'test-runs' would greatly encourage usage.
Hopefully 'Gatling' and such have many, if not most of the features 
needed to automate the evaluation of results.


> Finally, I've never written a non-trivial Mesos framework, how should I go about, or find more
> documentation, to get started? I'm looking for best practices, pitfalls, etc.
>
>
> Thank you for your time,
> Carlos

hth,
James