You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2019/09/23 19:04:53 UTC

Google Cloud and Spark in the docker consideration for rreal time streaming data

Hi,

We have designed an efficient trading system that runs on cluster of three
nodes in Google Cloud Platform (GCP).

Each node runs one Zookeeper and one Kafka broker. So in total we have
three Zookeeper dockers and three Kafka broker dockers. All these nodes
communicate with each other using Internal IP addresses as opposed to
ephemeral external IP addresses that change if the compute node is
restarted.

The trading system uses streaming topic that is also generated in JSON
format from one of the compute nodes. For all dockers , we specified  the
*--net*=*host* option so the  Docker uses the host's network stack for the
container. The network configuration of the container is the same as that
of the host and the container shares the service ports that are available
to the host. Since each host serves as the host for the containers this
port mapping between the docker and host works fine.

A high level design is shown below

[image: image.png]
Unlike a typical Lambda Architecture, we have replaced the Batch Layer
using traditional data-warehouse such as Hbase or Hive with flash optimized
Aerospike database with very low latency. For redundancy we created a two
node Aerospike cluster.  All incoming ticker (trade) data (from Kafka
topic) are stored in one Aerospike set using Aerospike-Kafka connector. For
speed layer, we used Spark SQL with Scala and FP to look for high value
trades as they stream in, and if the price is desirable (according to the
decision engine), the a notification is sent to the real time dashboard and
these high value events are also stored in another Aerospike set (table) as
well.

We installed Spark binaries (Spark version 2.4.3). We did not* use *dockers
for Spark. We started testing using Local (a single JVM) which was a smoke
test and proved that the design worked, However, there are significant
challenges that need to be addressed when running a distributed application
like Spark in a multi-host environment. We opted for running Spark on
Standalone mode. One node runs Spark master, and the workers run on
all three nodes. Each node is "n1-standard-4 (4 vCPUs, 15 GB memory)". In
summary, one node runs the Spark master plus Spark workers plus one
zookeeper docker and one Kafka Broker docker. The remaining two nodes
run one zookeeper docker, one Kafka Broker docker and one Aerospike docker
and Spark workers.

We thought a lot about it and decided not to rely on Spark docker. Most
literature seems to point to certain concepts but not real life
dockertisation of Spark. So my question is how would this design would be
different/better if we decided to build Spark using dockers?

Thanks and apologies for this long-winded test.

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.