You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2016/10/10 22:40:53 UTC

Design consideration for a trading System

Hi,

I have been working on some Lambda Architecture for trading systems.

I think I have completed the dry runs for testing the modules.

For batch layer the criteria is a day's lag (one day old data). This is
acceptable for the users who come from BI background using Tableau but I
think we can do better.

*For batch layer* the design is as follows:


   1. Trade prices are collect by Kafka at the rate of 2 seconds comprising
   a batch of 20 prices for each security (ticker), each row of csv has UUID,
   ticker, timestamp and price, comma separated
   2. In a minute we have 600 prices, in an hour  36K and a day 864K prices
   3. Flume is used to put these "raw" prices on HDFS (source -> Kafka ->
   Flume -> HDFS)
   4. Every 15 minutes (24x7) a cron runs that picks up these files,
   creates a large file comprising prices from small files and does an insert
   into Hbase table. This process is done via ImportTsv and is pretty fast
   (under 2 minutes). Doing individual file reads is terribly slow with
   ImporTsv that uses map-reduce
   5. As an additional test we also have a cron that picks up these
   individual  raw files every 15 minutes and puts them into Hive table (this
   will be decommissioned eventually)
   6. We have views built in Hive and Phoenix that are on top of Hbase
   tables. Phoenix works OK in Zeppelin (albeit not with rich SQL as Hive).
   Also we can use Spark on Hive tables in Zeppelin and we can use Tableau
   with Hive and Spark and its data caching facilities (in Tableau server)
   7. It is my understanding that the new release of Hive will have LLAP as
   its in-memory database.

So in summary the batch layer as of now offers all data with 15 minutes lag

In designing this layer I had to take into account the ease of use and
familiarity of our users with Tableau and Zeppelin (eventually). In
addition, certain power users can use Spark FP in Zeppelin as well.

The Tableau users tend to do analytics at macro level. In other words what
they want is speedy queries not necessarily on the latest data.

*For speed layer*


   1. Kafka feeds data into spark streaming (SS)
   2. For each RDD, prices are analysed for every row
   3. Calculations are done on prices and if they satisfy the criteria, a
   message is sent to Alert (currently a simple Beep). The row is also
   displayed for these particular prices to alert the trader
   4. The details of these trades are flushed to another Hbase table real
   time, impressively fast

Mon Oct 10 22:37:52 BST 2016, Price on S16 hit 99.25724
Mon Oct 10 22:37:56 BST 2016, Price on S05 hit 99.5992
So this is the basics of speed layer

The serving layer will have combined data from both batch and speed layers.
We are in the process of building a real time dashboard as well.

I thought a bit and decided that extracting Hbase data through Hive or
Phoenix skins provide what we need together with Spark. I have not yet
managed to use Phoenix on Spark and using Hbase directly for Spark 2 is not
available and even if we did using SQL skin for visualisation tools are
better.

Sorry about this long monologue.  Appreciate any feedbacks.

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.