You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Harish Butani <rh...@gmail.com> on 2017/04/13 02:40:49 UTC
Re: Design patterns involving Spark

BTW, we now support OLAP functionality natively in spark w/o the need for
Druid, through our Spark native BI platform(SNAP):
https://www.linkedin.com/pulse/integrated-business-intelligence-big-data-stacks-harish-butani

 - we provide SQL commands to: create star schema, create olap index, and
insert into olap index. So you can be up and running very quickly in a
Spark env.
- Query Acceleration is provided through an OLAP Index FileFormat and Query
Optimizer extensions(just like spark-druid-olap).
- We have also posted details on a BI Benchmark
<https://www.linkedin.com/pulse/integrated-business-intelligence-big-data-stacks-harish-butani>
to quantify
query acceleration and cost.
- haven't looked at integration with Spark Streaming yet, but since we have
a FileFormat should be possible to integrate. Please ping me if this is of
interest.

regards,
Harish.


On Mon, Aug 29, 2016 at 7:19 PM, Chanh Le <gi...@gmail.com> wrote:

> Hi everyone,
>
> Seems a lot people using Druid for realtime Dashboard.
> I’m just wondering of using Druid for main storage engine because Druid
> can store the raw data and can integrate with Spark also (theoretical).
> In that case do we need to store 2 separate storage Druid (store segment
> in HDFS) and HDFS?.
> BTW did anyone try this one https://github.com/
> SparklineData/spark-druid-olap?
>
>
> Regards,
> Chanh
>
>
> On Aug 30, 2016, at 3:23 AM, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> Thanks Bhaarat and everyone.
>
> This is an updated version of the same diagram
>
> <LambdaArchitecture.png>
> 
> The frequency of Recent data is defined by the Windows length in Spark
> Streaming. It can vary between 0.5 seconds to an hour. ( Don't think we can
> move any Spark granularity below 0.5 seconds in anger. For some
> applications like Credit card transactions and fraud detection. Data is
> stored real time by Spark in Hbase tables. Hbase tables will be on HDFS as
> well. The same Spark Streaming will write asynchronously to HDFS Hive
> tables.
> One school of thought is never write to Hive from Spark, write  straight
> to Hbase and then read Hbase tables into Hive periodically?
>
> Now the third component in this layer is Serving Layer that can combine
> data from the current (Hbase) and the historical (Hive tables) to give the
> user visual analytics. Now that visual analytics can be Real time dashboard
> on top of Serving Layer. That Serving layer could be an in-memory NoSQL
> offering or Data from Hbase (Red Box) combined with Hive tables.
>
> I am not aware of any industrial strength Real time Dashboard.  The idea
> is that one uses such dashboard in real time. Dashboard in this sense
> meaning a general purpose API to data store of some type like on Serving
> layer to provide visual analytics real time on demand, combining real time
> data and aggregate views. As usual the devil in the detail.
>
>
>
> Let me know your thoughts. Anyway this is first cut pattern.
>
> 
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 August 2016 at 18:53, Bhaarat Sharma <bh...@gmail.com> wrote:
>
>> Hi Mich
>>
>> This is really helpful. I'm trying to wrap my head around the last
>> diagram you shared (the one with kafka). In this diagram spark streaming is
>> pushing data to HDFS and NoSql. However, I'm confused by the "Real Time
>> Queries, Dashboards" annotation. Based on this diagram, will real time
>> queries be running on Spark or HBase?
>>
>> PS: My intention was not to steer the conversation away from what Ashok
>> asked but I found the diagrams shared by Mich very insightful.
>>
>> On Sun, Aug 28, 2016 at 7:18 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> In terms of positioning, Spark is really the first Big Data platform to
>>> integrate batch, streaming and interactive computations in a unified
>>> framework. What this boils down to is the fact that whichever way one look
>>> at it there is somewhere that Spark can make a contribution to. In general,
>>> there are few design patterns common to Big Data
>>>
>>>
>>>
>>>    - *ETL & Batch*
>>>
>>> The first one is the most common one with Established tools like Sqoop,
>>> Talend for ETL and HDFS for storage of some kind. Spark can be used as the
>>> execution engine for Hive at the storage level which  actually makes it
>>> a true vendor independent (BTW, Impala and Tez and LLAP) are offered by
>>> vendors) processing engine. Personally I use Spark at ETL layer by
>>> extracting data from sources through plug ins (JDBC and others) and storing
>>> in on HDFS in some kind
>>>
>>>
>>>
>>>    - *Batch, real time plus Analytics*
>>>
>>> In this pattern you have data coming in real time and you want to query
>>> them real time through real time dashboard. HDFS is not ideal for updating
>>> data in real time and neither for random access of data. Source could be
>>> all sorts of Web Servers and need Flume Agent with Flume. At the storage
>>> layer we are probably looking at something like Hbase. The crucial point
>>> being that saved data needs to be ready for queries immediately The
>>> dashboards requires Hbase APIs. The Analytics can be done through Hive
>>> again running on Spark engine. Again note here that we ideally should
>>> process batch and real time separately.
>>>
>>>
>>>
>>>    - *Real time / Streaming*
>>>
>>> This is most relevant to Spark as we are moving to near real time. Where
>>> Spark excels. We need to capture the incoming events (logs, sensor data,
>>> pricing, emails) through interfaces like Kafka, Message Queues etc.  Need
>>> to process these events with minimum latency. Again Spark is a very good
>>> candidate here with its Spark Streaming and micro-batching capabilities.
>>> There are others like Storm, Flink etc. that are event based but you don’t
>>> hear much. Again for streaming architecture you need to sync data in real
>>> time using something like Hbase, Cassandra (?) and others as real time
>>> store or forever storage HDFS or Hive etc.
>>>
>>>
>>>             In general there is also *Lambda Architecture* that is
>>> designed for streaming analytics. The streaming data ends up in both batch
>>> layer and speed layer. Batch layer is used to answer batch queries. On the
>>> other hand speed later is used ti handle fast/real time queries. This model
>>> is really cool as Spark Streaming can feed both the batch layer and
>>> the speed layer.
>>>
>>>
>>> At a high level this looks like this, from
>>> http://lambda-architecture.net/
>>>
>>> <image.png>
>>>
>>>
>>>
>>>
>>>
>>> My favourite would be something like below with Spark playing a major
>>> role
>>>
>>>
>>> <LambdaArchitecture.png>
>>> 
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 28 August 2016 at 19:43, Sivakumaran S <si...@me.com> wrote:
>>>
>>>> Spark best fits for processing. But depending on the use case, you
>>>> could expand the scope of Spark to moving data using the native connectors.
>>>> The only that Spark is not, is Storage. Connectors are available for most
>>>> storage options though.
>>>>
>>>> Regards,
>>>>
>>>> Sivakumaran S
>>>>
>>>>
>>>>
>>>> On 28-Aug-2016, at 6:04 PM, Ashok Kumar <ashok34668@yahoo.com.INVALID
>>>> <as...@yahoo.com.invalid>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> There are design patterns that use Spark extensively. I am new to this
>>>> area so I would appreciate if someone explains where Spark fits in
>>>> especially within faster or streaming use case.
>>>>
>>>> What are the best practices involving Spark. Is it always best to
>>>> deploy it for processing engine,
>>>>
>>>> For example when we have a pattern
>>>>
>>>> Input Data -> Data in Motion -> Processing -> Storage
>>>>
>>>> Where does Spark best fit in.
>>>>
>>>> Thanking you
>>>>
>>>>
>>>
>>
>
>