You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mesos.apache.org by vincent gromakowski <vi...@gmail.com> on 2015/12/30 18:56:37 UTC

mesos, big data and service discovery

I am currently using mesos as a big data backend for spark, cassandra,
kafka and elasticsearch but I cannot find a good overall design regarding
service discovery. I explain:
Generally, the service discovery is managed by a HAproxy instance on each
node which redirect trafic from service ports to real assigned network
ports. Currently I am not using it because the cluster is quite small and I
don't need to deploy lots of service but I am thinking on futur design that
will allows me to scale.
The problem with HAproxy dealing with all network trafic is that I am
afraid it will break the data locality which is so important in the big
data world regarding performances.
For example when Spark tries to connect to elasticsearch, it will discover
the elasticsearch topology and try to launch tasks next to elasticsearch
shards. If HAproxy intercept network flows, what would be the result ?
Will HAproxy masquarade the elasticsearch  IP/ports ? Same thing for Kafka
and Cassandra ?

I assume it depends on each connector but it's very hard to find any
information. Thanks for your help if you have any experience in it.
Regards

Re: mesos, big data and service discovery

Posted by John Omernik <jo...@omernik.com>.

So, no, I don't have Elastic Search in HA Proxy. For each instance of
Elastic Search I have, I specify the ports to user (a range) Now, Spark
can't do service discovery of Elastic serach the way Elastic Search can, so
that could be a challenge. That said, each ES node can be connected to
directly, so perhaps registering each node and using Mesos-DNS with static
ports. Another option is to have your spark app do a little bit of service
discovery of your own. Perhaps specify a range of ports that Elastic Search
COULD be running on and then some nodes it could be running on and have
Spark go guess and check. Since it's your app, you can put what ever logic
you want. I guess, what I am saying is there is nothing built into Spark or
ES to what you want, but between Mesos-DNS, your ability to customize code
in Spark, there should be a clever way to approach that fits your
Environment.



On Wed, Dec 30, 2015 at 12:10 PM, vincent gromakowski <
vincent.gromakowski@gmail.com> wrote:

> Can you confirm what I understand ? Spark will connect to Elasticsearch
> through the service port (means HApoxy) and then will get direct IP/ports
> for the topology?
>
> 2015-12-30 19:06 GMT+01:00 John Omernik <jo...@omernik.com>:
>
>> I would say that service discovery is only for those services that don't
>> have a built in method for discovery. When I run Elastic Search, I specify
>> the port range I can start elastic search in, and let it run. If the port
>> is taken, it tries a different one (I am using the Elastic Search for Yarn
>> package running on Apache Myriad).  Since I know which nodes and what port
>> ranges to use, I just add that to my Elastic Search config, and thus HA
>> proxy is not intercepting that traffic.  If I have a front end running in
>> Flask that connects to the ES back end, then I would use Mesos-DNS with
>> HAProxy to solve that problem.  In  addition, Spark as a framework does the
>> service discovery, HA Proxy wouldn't be getting inbetween spark nodes, same
>> with Kafka (I haven't played with Cassandra yet).
>>
>> There is some work being done on IP per container which will help this as
>> well, but all in all, I've found that as long I am some what smart about my
>> frameworks, I can manage them (my cluster isn't huge either).   As things
>> grow, I am hoping to grow into IP per container.
>>
>> John
>>
>>
>> On Wed, Dec 30, 2015 at 11:56 AM, vincent gromakowski <
>> vincent.gromakowski@gmail.com> wrote:
>>
>>> I am currently using mesos as a big data backend for spark, cassandra,
>>> kafka and elasticsearch but I cannot find a good overall design regarding
>>> service discovery. I explain:
>>> Generally, the service discovery is managed by a HAproxy instance on
>>> each node which redirect trafic from service ports to real assigned network
>>> ports. Currently I am not using it because the cluster is quite small and I
>>> don't need to deploy lots of service but I am thinking on futur design that
>>> will allows me to scale.
>>> The problem with HAproxy dealing with all network trafic is that I am
>>> afraid it will break the data locality which is so important in the big
>>> data world regarding performances.
>>> For example when Spark tries to connect to elasticsearch, it will
>>> discover the elasticsearch topology and try to launch tasks next to
>>> elasticsearch shards. If HAproxy intercept network flows, what would be the
>>> result ?  Will HAproxy masquarade the elasticsearch  IP/ports ? Same thing
>>> for Kafka and Cassandra ?
>>>
>>> I assume it depends on each connector but it's very hard to find any
>>> information. Thanks for your help if you have any experience in it.
>>> Regards
>>>
>>>
>>>
>>
>

Re: mesos, big data and service discovery

Posted by vincent gromakowski <vi...@gmail.com>.

Can you confirm what I understand ? Spark will connect to Elasticsearch
through the service port (means HApoxy) and then will get direct IP/ports
for the topology?

2015-12-30 19:06 GMT+01:00 John Omernik <jo...@omernik.com>:

> I would say that service discovery is only for those services that don't
> have a built in method for discovery. When I run Elastic Search, I specify
> the port range I can start elastic search in, and let it run. If the port
> is taken, it tries a different one (I am using the Elastic Search for Yarn
> package running on Apache Myriad).  Since I know which nodes and what port
> ranges to use, I just add that to my Elastic Search config, and thus HA
> proxy is not intercepting that traffic.  If I have a front end running in
> Flask that connects to the ES back end, then I would use Mesos-DNS with
> HAProxy to solve that problem.  In  addition, Spark as a framework does the
> service discovery, HA Proxy wouldn't be getting inbetween spark nodes, same
> with Kafka (I haven't played with Cassandra yet).
>
> There is some work being done on IP per container which will help this as
> well, but all in all, I've found that as long I am some what smart about my
> frameworks, I can manage them (my cluster isn't huge either).   As things
> grow, I am hoping to grow into IP per container.
>
> John
>
>
> On Wed, Dec 30, 2015 at 11:56 AM, vincent gromakowski <
> vincent.gromakowski@gmail.com> wrote:
>
>> I am currently using mesos as a big data backend for spark, cassandra,
>> kafka and elasticsearch but I cannot find a good overall design regarding
>> service discovery. I explain:
>> Generally, the service discovery is managed by a HAproxy instance on each
>> node which redirect trafic from service ports to real assigned network
>> ports. Currently I am not using it because the cluster is quite small and I
>> don't need to deploy lots of service but I am thinking on futur design that
>> will allows me to scale.
>> The problem with HAproxy dealing with all network trafic is that I am
>> afraid it will break the data locality which is so important in the big
>> data world regarding performances.
>> For example when Spark tries to connect to elasticsearch, it will
>> discover the elasticsearch topology and try to launch tasks next to
>> elasticsearch shards. If HAproxy intercept network flows, what would be the
>> result ?  Will HAproxy masquarade the elasticsearch  IP/ports ? Same thing
>> for Kafka and Cassandra ?
>>
>> I assume it depends on each connector but it's very hard to find any
>> information. Thanks for your help if you have any experience in it.
>> Regards
>>
>>
>>
>

Re: mesos, big data and service discovery

Posted by John Omernik <jo...@omernik.com>.

I would say that service discovery is only for those services that don't
have a built in method for discovery. When I run Elastic Search, I specify
the port range I can start elastic search in, and let it run. If the port
is taken, it tries a different one (I am using the Elastic Search for Yarn
package running on Apache Myriad).  Since I know which nodes and what port
ranges to use, I just add that to my Elastic Search config, and thus HA
proxy is not intercepting that traffic.  If I have a front end running in
Flask that connects to the ES back end, then I would use Mesos-DNS with
HAProxy to solve that problem.  In  addition, Spark as a framework does the
service discovery, HA Proxy wouldn't be getting inbetween spark nodes, same
with Kafka (I haven't played with Cassandra yet).

There is some work being done on IP per container which will help this as
well, but all in all, I've found that as long I am some what smart about my
frameworks, I can manage them (my cluster isn't huge either).   As things
grow, I am hoping to grow into IP per container.

John

On Wed, Dec 30, 2015 at 11:56 AM, vincent gromakowski <
vincent.gromakowski@gmail.com> wrote:

> I am currently using mesos as a big data backend for spark, cassandra,
> kafka and elasticsearch but I cannot find a good overall design regarding
> service discovery. I explain:
> Generally, the service discovery is managed by a HAproxy instance on each
> node which redirect trafic from service ports to real assigned network
> ports. Currently I am not using it because the cluster is quite small and I
> don't need to deploy lots of service but I am thinking on futur design that
> will allows me to scale.
> The problem with HAproxy dealing with all network trafic is that I am
> afraid it will break the data locality which is so important in the big
> data world regarding performances.
> For example when Spark tries to connect to elasticsearch, it will discover
> the elasticsearch topology and try to launch tasks next to elasticsearch
> shards. If HAproxy intercept network flows, what would be the result ?
> Will HAproxy masquarade the elasticsearch  IP/ports ? Same thing for Kafka
> and Cassandra ?
>
> I assume it depends on each connector but it's very hard to find any
> information. Thanks for your help if you have any experience in it.
> Regards
>
>
>

Re: mesos, big data and service discovery

Posted by vincent gromakowski <vi...@gmail.com>.

Good idea to get data locality for non distributed apps but spark driver
will distribute  info to workers so it may result in all workers connecting
to instance on the same node as the driver.
I will do some  test...
Le 31 déc. 2015 1:26 AM, "Shuai Lin" <li...@gmail.com> a écrit :

> What about specifying all non-local instances as "backup" in haproxy.cfg?
> This way haproxy would only direct traffic to the local instance as long as
> the local instance is alive.
>
> For example, if you plan to use the haproxy-marathon-bridge script, you
> can modify this line to achieve that:
> https://github.com/mesosphere/marathon/blob/8b3ce8844dcc53055345914ef11019789dd843cf/bin/haproxy-marathon-bridge#L162
> .
>
>
> On Thu, Dec 31, 2015 at 1:56 AM, vincent gromakowski <
> vincent.gromakowski@gmail.com> wrote:
>
>> I am currently using mesos as a big data backend for spark, cassandra,
>> kafka and elasticsearch but I cannot find a good overall design regarding
>> service discovery. I explain:
>> Generally, the service discovery is managed by a HAproxy instance on each
>> node which redirect trafic from service ports to real assigned network
>> ports. Currently I am not using it because the cluster is quite small and I
>> don't need to deploy lots of service but I am thinking on futur design that
>> will allows me to scale.
>> The problem with HAproxy dealing with all network trafic is that I am
>> afraid it will break the data locality which is so important in the big
>> data world regarding performances.
>> For example when Spark tries to connect to elasticsearch, it will
>> discover the elasticsearch topology and try to launch tasks next to
>> elasticsearch shards. If HAproxy intercept network flows, what would be the
>> result ?  Will HAproxy masquarade the elasticsearch  IP/ports ? Same thing
>> for Kafka and Cassandra ?
>>
>> I assume it depends on each connector but it's very hard to find any
>> information. Thanks for your help if you have any experience in it.
>> Regards
>>
>>
>>
>

Re: mesos, big data and service discovery

Posted by Shuai Lin <li...@gmail.com>.

What about specifying all non-local instances as "backup" in haproxy.cfg?
This way haproxy would only direct traffic to the local instance as long as
the local instance is alive.

For example, if you plan to use the haproxy-marathon-bridge script, you can
modify this line to achieve that:
https://github.com/mesosphere/marathon/blob/8b3ce8844dcc53055345914ef11019789dd843cf/bin/haproxy-marathon-bridge#L162
.


On Thu, Dec 31, 2015 at 1:56 AM, vincent gromakowski <
vincent.gromakowski@gmail.com> wrote:

> I am currently using mesos as a big data backend for spark, cassandra,
> kafka and elasticsearch but I cannot find a good overall design regarding
> service discovery. I explain:
> Generally, the service discovery is managed by a HAproxy instance on each
> node which redirect trafic from service ports to real assigned network
> ports. Currently I am not using it because the cluster is quite small and I
> don't need to deploy lots of service but I am thinking on futur design that
> will allows me to scale.
> The problem with HAproxy dealing with all network trafic is that I am
> afraid it will break the data locality which is so important in the big
> data world regarding performances.
> For example when Spark tries to connect to elasticsearch, it will discover
> the elasticsearch topology and try to launch tasks next to elasticsearch
> shards. If HAproxy intercept network flows, what would be the result ?
> Will HAproxy masquarade the elasticsearch  IP/ports ? Same thing for Kafka
> and Cassandra ?
>
> I assume it depends on each connector but it's very hard to find any
> information. Thanks for your help if you have any experience in it.
> Regards
>
>
>