You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by sparkie <le...@gmail.com> on 2016/06/08 04:02:52 UTC

Apache design pattern approaches

Hi I have been working through some examples tutorials for Apache Spark in an
attempt to establish how I would solve the following scenario (see data
examples in Appendix):
I have 1 billion+ rows that have a key value (i.e. driver ID) and a number
of relevant attributes (product class, date/time) that I need to evaluate
using certain business rules/algorithms. These rules are based on grouped
data (i.e. perform business rules on driver ID 1 and then perform the same
rules on driver ID 2 etc.); typical business rules include the ability to
perform backward and forward looking checks (see sample below) within a
grouped dataset. Importantly, I need to process the grouped data (driver ID
1, 2,3,4 …) concurrently. An example of the business rules:

For each data grouping / set (i.e. driver ID = 1 order chronologically by
date):
· The first row is always an ‘initiate’ = ROW ID 1

· the product class value has previously/future (backward or forward
looking) occurred = ‘DUPLICATE’ = ROW ID 2

· changed product (backward looking only) in the same product class
aka A - > A1 = ‘SWAP’ = ROW ID 3

· because the product has not previously occurred = ‘ADD’ = ROW ID
4

· the product class value previously/future (backward or forward
looking) occurred = ‘DUPLICATE’ = ROW ID 5

· the product class value previously/future (backward or forward
looking) occurred = ‘DUPLICATE’ = ROW ID 6

Questions:
1. Should I use dataframes to ‘pull the source data? If so, do I do a
groupby and order by as part of the SQL query?

2. How do I then split the grouped data (i.e. driver ID key value
pairs) to then be parallelized for concurrent processing (i.e. ideally the
number of parallel datasets/grouped data should run at max node cluster
capacity)? DO I need to do some sort of mappartitioning ?

3. Pending (1) & (2) answers: How does each (i.e. grouped data set)
dataframe or RDD or dataset perform these rules based checks (i.e. backward
and forward looking checks) ? i.e. how is this achieved in SPARK?

ps. I have solid JAVA background but a complete Apache Spark novice so your
help would be really appreciated

Appendix
Input/OUTPUT

ROWID, Driver ID, product class,
date, RESULT
1, 1, A,
1/1/16 INITIATE
2, 1, A,
2/2/16 DUPLICATE
3, 1, A1,
3/4/16 SWAP
4, 1, B,
2/5/16 ADD
5, 1, C,
1/1/16 DUPLICATE
6, 1, C,
2/2/16 DUPLICATE
7, 2, A,
2/2/16 INITIATE
8, 2, B,
3/4/16 ADD
9, 2, A,
2/5/16 DUPLICATE

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-design-pattern-approaches-tp27109.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org