You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2016/09/30 16:17:37 UTC

Design considerations for batch and speed layers

I have designed this prototype for a risk business. Here I would like to
discuss issues with batch layer. *Apologies about being long winded.*

*Business objective*

Reduce risk in the credit business while making better credit and trading
decisions. Specifically, to identify risk trends within certain years of
trading data. For example, measure the risk exposure in a give portfolio by
industry, region, credit rating and other parameters. At the macroscopic
level, analyze data across market sectors, over a given time horizon to
asses risk changes

*Deliverable*
Enable real time and batch analysis of risk data

*Batch technology stack used*
Kafka -> zookeeper, Flume, HDFS (raw data), Hive, cron, Spark as the query
tool, Zeppelin

*Test volumes for POC*
1 message queue (csv format), 100 stock prices streaming in very 2 seconds,
180K prices per hour, 4 million + per day

1. prices to Kafka -> Zookeeper -> Flume -> HDFS
2. HDFS daily partition for that day's data
3. Hive external table looking at HDFS partitioned location
4. Hive managed table populated every 15 minutes via cron from Hive
external table (table type ORC partitioned by date). This is purely Hive
job. Hive table is populated using insert/overwrite for that day to
avoid boundary value/missing data etc.
5. Typical batch ingestion time (Hive table populated from HDFS files) ~
2 minutes
6. Data in Hive table has 15 minutes latency
7. Zeppelin to be used as UI with Spark

Zeppelin will use Spark SQL (on Spark Thrift Server) and Spark shell.
Within Spark shell, users can access batch tables in Hive *or *they have a
choice of accessing raw data on HDFS files which gives them* real time
access * (not to be confused with speed layer). Using typical query with
Spark, to see the last 15 minutes of real time data (T-15 -Now) takes 1
min. Running the same query (my typical query not user query) on Hive
tables this time using Spark takes 6 seconds.

However, there are some design concerns:

1. Zeppelin starts slowing down by the end of day. Sometimes it throws
broken pipe message. I resolve this by restarting Zeppelin daemon.
Potential show stopper
2. As the volume of data increases throughout the day, performance
becomes an issue
3. Every 15 minutes when the cron starts, Hive insert/overwrites can
potentially get in conflict with users throwing queries from
Zeppelin/Spark. I am sure that with exclusive writes, Hive will block all
users from accessing these tables (at partition level) until insert
overwrite is done. This can be improved by better partitioning of Hive
tables or relaxing ingestion time to half hour or one hour at a cost of
more lagging. I tried Parquet tables in Hive but really no difference in
performance gain. I have thought of replacing Hive with Hbase etc. but that
brings new complications in as well without necessarily solving the issue.
4. I am not convinced this design can scale up easily with 5 times more
volume of data.
5. We will also get real time data from RDBMS tables (Oracle, Sybase,
MSSQL)using replication technologies such as Sap Replication Server. These
currently deliver changed log data to Hive tables. So there is some
compatibility issue here.

So I am sure some members can add useful ideas :)

Thanks

Mich

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Re: Design considerations for batch and speed layers

Posted by Rodrick Brown <ro...@orchardplatform.com>.

We do processing millions of records using Kafka, Elastic Search, Accumulo,
Mesos, Spark & Vertica.

Their a pattern for this type of pipeline today called SMACK more about
here --
http://www.slideshare.net/akirillov/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka


On Fri, Sep 30, 2016 at 4:55 PM, Ashok Kumar <as...@yahoo.com.invalid>
wrote:

> Can one design a fast pipeline with Kafka, Spark streaming and Hbase  or
> something similar?
>
>
>
>
>
> On Friday, 30 September 2016, 17:17, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>
> I have designed this prototype for a risk business. Here I would like to
> discuss issues with batch layer. *Apologies about being long winded.*
>
> *Business objective*
>
> Reduce risk in the credit business while making better credit and trading
> decisions. Specifically, to identify risk trends within certain years of
> trading data. For example, measure the risk exposure in a give portfolio by
> industry, region, credit rating and other parameters. At the macroscopic
> level, analyze data across market sectors, over a given time horizon to
> asses risk changes
>
> *Deliverable*
> Enable real time and batch analysis of risk data
>
> *Batch technology stack used*
> Kafka -> zookeeper, Flume, HDFS (raw data), Hive, cron, Spark as the query
> tool, Zeppelin
>
> *Test volumes for POC*
> 1 message queue (csv format), 100 stock prices streaming in very 2
> seconds, 180K prices per hour, 4 million + per day
>
>
>    1. prices to Kafka -> Zookeeper -> Flume -> HDFS
>    2. HDFS daily partition for that day's data
>    3. Hive external table looking at HDFS partitioned location
>    4. Hive managed table populated every 15 minutes via cron from Hive
>    external table (table type ORC partitioned by date). This is purely Hive
>    job. Hive table is populated using insert/overwrite for that day to
>    avoid boundary value/missing data etc.
>    5. Typical batch ingestion time (Hive table populated from HDFS files)
>    ~ 2 minutes
>    6. Data in Hive table has 15 minutes latency
>    7. Zeppelin to be used as UI with Spark
>
>
> Zeppelin will use Spark SQL (on Spark Thrift Server) and Spark shell.
> Within Spark shell, users can access batch tables in Hive *or *they have
> a choice of accessing raw data on HDFS files which gives them* real time
> access * (not to be confused with speed layer).  Using typical query with
> Spark, to see the last 15 minutes of real time data (T-15 -Now) takes 1
> min. Running the same query (my typical query not user query) on Hive
> tables this time using Spark takes 6 seconds.
>
> However, there are some  design concerns:
>
>
>    1. Zeppelin starts slowing down by the end of day. Sometimes it throws
>    broken pipe message. I resolve this by restarting Zeppelin daemon.
>    Potential show stopper
>    2. As the volume of data increases throughout the day, performance
>    becomes an issue
>    3. Every 15 minutes when the cron starts, Hive insert/overwrites can
>    potentially get in conflict with users throwing queries from
>    Zeppelin/Spark. I am sure that with exclusive writes, Hive will block all
>    users from accessing these tables (at partition level) until insert
>    overwrite is done. This can be improved by better partitioning of Hive
>    tables or relaxing ingestion time to half hour or one hour at a cost of
>    more lagging. I tried Parquet tables in Hive but really no difference in
>    performance gain. I have thought of replacing Hive with Hbase etc. but that
>    brings new complications in as well without necessarily solving the issue.
>    4. I am not convinced this design can scale up easily with 5 times
>    more volume of data.
>    5. We will also get real time data from RDBMS tables (Oracle, Sybase,
>    MSSQL)using replication technologies such as Sap Replication Server. These
>    currently deliver changed log data to Hive tables. So there is some
>    compatibility issue here.
>
>
> So I am sure some members can add useful ideas :)
>
> Thanks
>
> Mich
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>


-- 

[image: Orchard Platform] <http://www.orchardplatform.com/>

*Rodrick Brown */ *DevOPs*

9174456839 / rodrick@orchardplatform.com

Orchard Platform
101 5th Avenue, 4th Floor, New York, NY

-- 
*NOTICE TO RECIPIENTS*: This communication is confidential and intended for 
the use of the addressee only. If you are not an intended recipient of this 
communication, please delete it immediately and notify the sender by return 
email. Unauthorized reading, dissemination, distribution or copying of this 
communication is prohibited. This communication does not constitute an 
offer to sell or a solicitation of an indication of interest to purchase 
any loan, security or any other financial product or instrument, nor is it 
an offer to sell or a solicitation of an indication of interest to purchase 
any products or services to any persons who are prohibited from receiving 
such information under applicable law. The contents of this communication 
may not be accurate or complete and are subject to change without notice. 
As such, Orchard App, Inc. (including its subsidiaries and affiliates, 
"Orchard") makes no representation regarding the accuracy or completeness 
of the information contained herein. The intended recipient is advised to 
consult its own professional advisors, including those specializing in 
legal, tax and accounting matters. Orchard does not provide legal, tax or 
accounting advice.

Re: Design considerations for batch and speed layers

Posted by Ashok Kumar <as...@yahoo.com.INVALID>.

Can one design a fast pipeline with Kafka, Spark streaming and Hbase or something similar?

On Friday, 30 September 2016, 17:17, Mich Talebzadeh <mi...@gmail.com> wrote:

I have designed this prototype for a risk business. Here I would like to discuss issues with batch layer. Apologies about being long winded.

Business objective
Reduce risk in the credit business while making better credit and trading decisions. Specifically, to identify risk trends within certain years of trading data. For example, measure the risk exposure in a give portfolio by industry, region, credit rating and other parameters. At the macroscopic level, analyze data across market sectors, over a given time horizon to asses risk changes DeliverableEnable real time and batch analysis of risk data Batch technology stack usedKafka -> zookeeper, Flume, HDFS (raw data), Hive, cron, Spark as the query tool, Zeppelin
Test volumes for POC1 message queue (csv format), 100 stock prices streaming in very 2 seconds, 180K prices per hour, 4 million + per day
- prices to Kafka -> Zookeeper -> Flume -> HDFS
- HDFS daily partition for that day's data
- Hive external table looking at HDFS partitioned location
- Hive managed table populated every 15 minutes via cron from Hive external table (table type ORC partitioned by date). This is purely Hive job. Hive table is populated using insert/overwrite for that day to avoid boundary value/missing data etc.
- Typical batch ingestion time (Hive table populated from HDFS files) ~ 2 minutes
- Data in Hive table has 15 minutes latency
- Zeppelin to be used as UI with Spark

Zeppelin will use Spark SQL (on Spark Thrift Server) and Spark shell. Within Spark shell, users can access batch tables in Hive or they have a choice of accessing raw data on HDFS files which gives them real time access (not to be confused with speed layer). Using typical query with Spark, to see the last 15 minutes of real time data (T-15 -Now) takes 1 min. Running the same query (my typical query not user query) on Hive tables this time using Spark takes 6 seconds.
However, there are some design concerns:

- Zeppelin starts slowing down by the end of day. Sometimes it throws broken pipe message. I resolve this by restarting Zeppelin daemon. Potential show stopper
- As the volume of data increases throughout the day, performance becomes an issue
- Every 15 minutes when the cron starts, Hive insert/overwrites can potentially get in conflict with users throwing queries from Zeppelin/Spark. I am sure that with exclusive writes, Hive will block all users from accessing these tables (at partition level) until insert overwrite is done. This can be improved by better partitioning of Hive tables or relaxing ingestion time to half hour or one hour at a cost of more lagging. I tried Parquet tables in Hive but really no difference in performance gain. I have thought of replacing Hive with Hbase etc. but that brings new complications in as well without necessarily solving the issue.
- I am not convinced this design can scale up easily with 5 times more volume of data.
- We will also get real time data from RDBMS tables (Oracle, Sybase, MSSQL)using replication technologies such as Sap Replication Server. These currently deliver changed log data to Hive tables. So there is some compatibility issue here.

So I am sure some members can add useful ideas :)
Thanks
Mich

LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction.