You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Ali Akhtar <al...@gmail.com> on 2016/09/29 13:54:50 UTC

Architecture recommendations for a tricky use case

I have a somewhat tricky use case, and I'm looking for ideas.

I have 5-6 Kafka producers, reading various APIs, and writing their raw
data into Kafka.

I need to:

- Do ETL on the data, and standardize it.

- Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
ElasticSearch / Postgres)

- Query this data to generate reports / analytics (There will be a web UI
which will be the front-end to the data, and will show the reports)

Java is being used as the backend language for everything (backend of the
web UI, as well as the ETL layer)

I'm considering:

- Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive
raw data from Kafka, standardize & store it)

- Using Cassandra, HBase, or raw HDFS, for storing the standardized data,
and to allow queries

- In the backend of the web UI, I could either use Spark to run queries
across the data (mostly filters), or directly run queries against Cassandra
/ HBase

I'd appreciate some thoughts / suggestions on which of these alternatives I
should go with (e.g, using raw Kafka consumers vs Spark for ETL, which
persistent data store to use, and how to query that data store in the
backend of the web UI, for displaying the reports).


Thanks.

Re: Architecture recommendations for a tricky use case

Posted by Deepak Sharma <de...@gmail.com>.

For ui , you need DB such as Cassandra that is designed to work around
queries .
Ingest the data to spark streaming (speed layer) and write to hdfs(for
batch layer).
Now you have data at rest as well as in motion(real time).
From spark streaming itself , do further processing and write the final
result to Cassandra/nosql DB.
UI can pick the data from the DB now.

Thanks
Deepak

On Thu, Sep 29, 2016 at 8:00 PM, Alonso Isidoro Roman <al...@gmail.com>
wrote:

> "Using Spark to query the data in the backend of the web UI?"
>
> Dont do that. I would recommend that spark streaming process stores data
> into some nosql or sql database and the web ui to query data from that
> database.
>
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>
>
> 2016-09-29 16:15 GMT+02:00 Ali Akhtar <al...@gmail.com>:
>
>> The web UI is actually the speed layer, it needs to be able to query the
>> data online, and show the results in real-time.
>>
>> It also needs a custom front-end, so a system like Tableau can't be used,
>> it must have a custom backend + front-end.
>>
>> Thanks for the recommendation of Flume. Do you think this will work:
>>
>> - Spark Streaming to read data from Kafka
>> - Storing the data on HDFS using Flume
>> - Using Spark to query the data in the backend of the web UI?
>>
>>
>>
>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> You need a batch layer and a speed layer. Data from Kafka can be stored
>>> on HDFS using flume.
>>>
>>> -  Query this data to generate reports / analytics (There will be a web
>>> UI which will be the front-end to the data, and will show the reports)
>>>
>>> This is basically batch layer and you need something like Tableau or
>>> Zeppelin to query data
>>>
>>> You will also need spark streaming to query data online for speed layer.
>>> That data could be stored in some transient fabric like ignite or even
>>> druid.
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com> wrote:
>>>
>>>> It needs to be able to scale to a very large amount of data, yes.
>>>>
>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma <de...@gmail.com>
>>>> wrote:
>>>>
>>>>> What is the message inflow ?
>>>>> If it's really high , definitely spark will be of great use .
>>>>>
>>>>> Thanks
>>>>> Deepak
>>>>>
>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:
>>>>>
>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>>>
>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>>>>> raw data into Kafka.
>>>>>>
>>>>>> I need to:
>>>>>>
>>>>>> - Do ETL on the data, and standardize it.
>>>>>>
>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS
>>>>>> / ElasticSearch / Postgres)
>>>>>>
>>>>>> - Query this data to generate reports / analytics (There will be a
>>>>>> web UI which will be the front-end to the data, and will show the reports)
>>>>>>
>>>>>> Java is being used as the backend language for everything (backend of
>>>>>> the web UI, as well as the ETL layer)
>>>>>>
>>>>>> I'm considering:
>>>>>>
>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>>>> (receive raw data from Kafka, standardize & store it)
>>>>>>
>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>>>> data, and to allow queries
>>>>>>
>>>>>> - In the backend of the web UI, I could either use Spark to run
>>>>>> queries across the data (mostly filters), or directly run queries against
>>>>>> Cassandra / HBase
>>>>>>
>>>>>> I'd appreciate some thoughts / suggestions on which of these
>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
>>>>>> ETL, which persistent data store to use, and how to query that data store
>>>>>> in the backend of the web UI, for displaying the reports).
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Re: Architecture recommendations for a tricky use case

Posted by Deepak Sharma <de...@gmail.com>.

For ui , you need DB such as Cassandra that is designed to work around
queries .
Ingest the data to spark streaming (speed layer) and write to hdfs(for
batch layer).
Now you have data at rest as well as in motion(real time).
From spark streaming itself , do further processing and write the final
result to Cassandra/nosql DB.
UI can pick the data from the DB now.

Thanks
Deepak

On Thu, Sep 29, 2016 at 8:00 PM, Alonso Isidoro Roman <al...@gmail.com>
wrote:

> "Using Spark to query the data in the backend of the web UI?"
>
> Dont do that. I would recommend that spark streaming process stores data
> into some nosql or sql database and the web ui to query data from that
> database.
>
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>
>
> 2016-09-29 16:15 GMT+02:00 Ali Akhtar <al...@gmail.com>:
>
>> The web UI is actually the speed layer, it needs to be able to query the
>> data online, and show the results in real-time.
>>
>> It also needs a custom front-end, so a system like Tableau can't be used,
>> it must have a custom backend + front-end.
>>
>> Thanks for the recommendation of Flume. Do you think this will work:
>>
>> - Spark Streaming to read data from Kafka
>> - Storing the data on HDFS using Flume
>> - Using Spark to query the data in the backend of the web UI?
>>
>>
>>
>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> You need a batch layer and a speed layer. Data from Kafka can be stored
>>> on HDFS using flume.
>>>
>>> -  Query this data to generate reports / analytics (There will be a web
>>> UI which will be the front-end to the data, and will show the reports)
>>>
>>> This is basically batch layer and you need something like Tableau or
>>> Zeppelin to query data
>>>
>>> You will also need spark streaming to query data online for speed layer.
>>> That data could be stored in some transient fabric like ignite or even
>>> druid.
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com> wrote:
>>>
>>>> It needs to be able to scale to a very large amount of data, yes.
>>>>
>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma <de...@gmail.com>
>>>> wrote:
>>>>
>>>>> What is the message inflow ?
>>>>> If it's really high , definitely spark will be of great use .
>>>>>
>>>>> Thanks
>>>>> Deepak
>>>>>
>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:
>>>>>
>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>>>
>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>>>>> raw data into Kafka.
>>>>>>
>>>>>> I need to:
>>>>>>
>>>>>> - Do ETL on the data, and standardize it.
>>>>>>
>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS
>>>>>> / ElasticSearch / Postgres)
>>>>>>
>>>>>> - Query this data to generate reports / analytics (There will be a
>>>>>> web UI which will be the front-end to the data, and will show the reports)
>>>>>>
>>>>>> Java is being used as the backend language for everything (backend of
>>>>>> the web UI, as well as the ETL layer)
>>>>>>
>>>>>> I'm considering:
>>>>>>
>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>>>> (receive raw data from Kafka, standardize & store it)
>>>>>>
>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>>>> data, and to allow queries
>>>>>>
>>>>>> - In the backend of the web UI, I could either use Spark to run
>>>>>> queries across the data (mostly filters), or directly run queries against
>>>>>> Cassandra / HBase
>>>>>>
>>>>>> I'd appreciate some thoughts / suggestions on which of these
>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
>>>>>> ETL, which persistent data store to use, and how to query that data store
>>>>>> in the backend of the web UI, for displaying the reports).
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Re: Architecture recommendations for a tricky use case

Posted by Alonso Isidoro Roman <al...@gmail.com>.

"Using Spark to query the data in the backend of the web UI?"

Dont do that. I would recommend that spark streaming process stores data
into some nosql or sql database and the web ui to query data from that
database.

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>

2016-09-29 16:15 GMT+02:00 Ali Akhtar <al...@gmail.com>:

> The web UI is actually the speed layer, it needs to be able to query the
> data online, and show the results in real-time.
>
> It also needs a custom front-end, so a system like Tableau can't be used,
> it must have a custom backend + front-end.
>
> Thanks for the recommendation of Flume. Do you think this will work:
>
> - Spark Streaming to read data from Kafka
> - Storing the data on HDFS using Flume
> - Using Spark to query the data in the backend of the web UI?
>
>
>
> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> You need a batch layer and a speed layer. Data from Kafka can be stored
>> on HDFS using flume.
>>
>> -  Query this data to generate reports / analytics (There will be a web
>> UI which will be the front-end to the data, and will show the reports)
>>
>> This is basically batch layer and you need something like Tableau or
>> Zeppelin to query data
>>
>> You will also need spark streaming to query data online for speed layer.
>> That data could be stored in some transient fabric like ignite or even
>> druid.
>>
>> HTH
>>
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com> wrote:
>>
>>> It needs to be able to scale to a very large amount of data, yes.
>>>
>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma <de...@gmail.com>
>>> wrote:
>>>
>>>> What is the message inflow ?
>>>> If it's really high , definitely spark will be of great use .
>>>>
>>>> Thanks
>>>> Deepak
>>>>
>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:
>>>>
>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>>
>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>>>> raw data into Kafka.
>>>>>
>>>>> I need to:
>>>>>
>>>>> - Do ETL on the data, and standardize it.
>>>>>
>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS
>>>>> / ElasticSearch / Postgres)
>>>>>
>>>>> - Query this data to generate reports / analytics (There will be a web
>>>>> UI which will be the front-end to the data, and will show the reports)
>>>>>
>>>>> Java is being used as the backend language for everything (backend of
>>>>> the web UI, as well as the ETL layer)
>>>>>
>>>>> I'm considering:
>>>>>
>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>>> (receive raw data from Kafka, standardize & store it)
>>>>>
>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>>> data, and to allow queries
>>>>>
>>>>> - In the backend of the web UI, I could either use Spark to run
>>>>> queries across the data (mostly filters), or directly run queries against
>>>>> Cassandra / HBase
>>>>>
>>>>> I'd appreciate some thoughts / suggestions on which of these
>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
>>>>> ETL, which persistent data store to use, and how to query that data store
>>>>> in the backend of the web UI, for displaying the reports).
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>
>>>
>>
>

RE: Architecture recommendations for a tricky use case

Posted by "Tauzell, Dave" <Da...@surescripts.com>.

Spark Streaming needs to store the output somewhere.  Cassandra is a possible target for that.

-Dave

-----Original Message-----
From: Ali Akhtar [mailto:ali.rac200@gmail.com]
Sent: Thursday, September 29, 2016 9:16 AM
Cc: users@kafka.apache.org; spark users
Subject: Re: Architecture recommendations for a tricky use case

The web UI is actually the speed layer, it needs to be able to query the data online, and show the results in real-time.

It also needs a custom front-end, so a system like Tableau can't be used, it must have a custom backend + front-end.

Thanks for the recommendation of Flume. Do you think this will work:

- Spark Streaming to read data from Kafka
- Storing the data on HDFS using Flume
- Using Spark to query the data in the backend of the web UI?



On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> You need a batch layer and a speed layer. Data from Kafka can be
> stored on HDFS using flume.
>
> -  Query this data to generate reports / analytics (There will be a
> web UI which will be the front-end to the data, and will show the
> reports)
>
> This is basically batch layer and you need something like Tableau or
> Zeppelin to query data
>
> You will also need spark streaming to query data online for speed layer.
> That data could be stored in some transient fabric like ignite or even
> druid.
>
> HTH
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn *
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCC
> dOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPC
> CdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which
> may arise from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising
> from such loss, damage or destruction.
>
>
>
> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com> wrote:
>
>> It needs to be able to scale to a very large amount of data, yes.
>>
>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
>> <de...@gmail.com>
>> wrote:
>>
>>> What is the message inflow ?
>>> If it's really high , definitely spark will be of great use .
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:
>>>
>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>
>>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>>> raw data into Kafka.
>>>>
>>>> I need to:
>>>>
>>>> - Do ETL on the data, and standardize it.
>>>>
>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw
>>>> HDFS / ElasticSearch / Postgres)
>>>>
>>>> - Query this data to generate reports / analytics (There will be a
>>>> web UI which will be the front-end to the data, and will show the
>>>> reports)
>>>>
>>>> Java is being used as the backend language for everything (backend
>>>> of the web UI, as well as the ETL layer)
>>>>
>>>> I'm considering:
>>>>
>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>> (receive raw data from Kafka, standardize & store it)
>>>>
>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>> data, and to allow queries
>>>>
>>>> - In the backend of the web UI, I could either use Spark to run
>>>> queries across the data (mostly filters), or directly run queries
>>>> against Cassandra / HBase
>>>>
>>>> I'd appreciate some thoughts / suggestions on which of these
>>>> alternatives I should go with (e.g, using raw Kafka consumers vs
>>>> Spark for ETL, which persistent data store to use, and how to query
>>>> that data store in the backend of the web UI, for displaying the reports).
>>>>
>>>>
>>>> Thanks.
>>>>
>>>
>>
>
This e-mail and any files transmitted with it are confidential, may contain sensitive information, and are intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error, please notify the sender by reply e-mail immediately and destroy all copies of the e-mail and any attachments.

Re: Architecture recommendations for a tricky use case

Posted by Alonso Isidoro Roman <al...@gmail.com>.

"Using Spark to query the data in the backend of the web UI?"

Dont do that. I would recommend that spark streaming process stores data
into some nosql or sql database and the web ui to query data from that
database.

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>

2016-09-29 16:15 GMT+02:00 Ali Akhtar <al...@gmail.com>:

> The web UI is actually the speed layer, it needs to be able to query the
> data online, and show the results in real-time.
>
> It also needs a custom front-end, so a system like Tableau can't be used,
> it must have a custom backend + front-end.
>
> Thanks for the recommendation of Flume. Do you think this will work:
>
> - Spark Streaming to read data from Kafka
> - Storing the data on HDFS using Flume
> - Using Spark to query the data in the backend of the web UI?
>
>
>
> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> You need a batch layer and a speed layer. Data from Kafka can be stored
>> on HDFS using flume.
>>
>> -  Query this data to generate reports / analytics (There will be a web
>> UI which will be the front-end to the data, and will show the reports)
>>
>> This is basically batch layer and you need something like Tableau or
>> Zeppelin to query data
>>
>> You will also need spark streaming to query data online for speed layer.
>> That data could be stored in some transient fabric like ignite or even
>> druid.
>>
>> HTH
>>
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com> wrote:
>>
>>> It needs to be able to scale to a very large amount of data, yes.
>>>
>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma <de...@gmail.com>
>>> wrote:
>>>
>>>> What is the message inflow ?
>>>> If it's really high , definitely spark will be of great use .
>>>>
>>>> Thanks
>>>> Deepak
>>>>
>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:
>>>>
>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>>
>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>>>> raw data into Kafka.
>>>>>
>>>>> I need to:
>>>>>
>>>>> - Do ETL on the data, and standardize it.
>>>>>
>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS
>>>>> / ElasticSearch / Postgres)
>>>>>
>>>>> - Query this data to generate reports / analytics (There will be a web
>>>>> UI which will be the front-end to the data, and will show the reports)
>>>>>
>>>>> Java is being used as the backend language for everything (backend of
>>>>> the web UI, as well as the ETL layer)
>>>>>
>>>>> I'm considering:
>>>>>
>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>>> (receive raw data from Kafka, standardize & store it)
>>>>>
>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>>> data, and to allow queries
>>>>>
>>>>> - In the backend of the web UI, I could either use Spark to run
>>>>> queries across the data (mostly filters), or directly run queries against
>>>>> Cassandra / HBase
>>>>>
>>>>> I'd appreciate some thoughts / suggestions on which of these
>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
>>>>> ETL, which persistent data store to use, and how to query that data store
>>>>> in the backend of the web UI, for displaying the reports).
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>
>>>
>>
>

Re: Architecture recommendations for a tricky use case

Posted by Andrew Stevenson <an...@datamountaineer.com>.

·         Kafka Connect for ingress “E”

·         Kafka Streams , Flink or Spark Streaming for “T” – Read from and write back to Kafka – Keep the sources of data for you processing engine small Separation of concerns, why should Spark care about where you upstream sources are for example

·         Kafka Connect for egress “L” to a datastore of your choice, Kudu, HDFS, Cassandra, ReThinkDB, HBase, postgre etc

·         RestProxy from Confluent or https://github.com/datamountaineer/stream-reactor/tree/master/kafka-socket-streamer for UI on real time streams



https://github.com/datamountaineer/stream-reactor





On 29/09/16 17:11, "Cody Koeninger" <co...@koeninger.org> wrote:



    How are you going to handle etl failures?  Do you care about lost /

    duplicated data?  Are your writes idempotent?



    Absent any other information about the problem, I'd stay away from

    cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream

    feeding postgres.



    On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <al...@gmail.com> wrote:

    > Is there an advantage to that vs directly consuming from Kafka? Nothing is

    > being done to the data except some light ETL and then storing it in

    > Cassandra

    >

    > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <de...@gmail.com>

    > wrote:

    >>

    >> Its better you use spark's direct stream to ingest from kafka.

    >>

    >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <al...@gmail.com> wrote:

    >>>

    >>> I don't think I need a different speed storage and batch storage. Just

    >>> taking in raw data from Kafka, standardizing, and storing it somewhere where

    >>> the web UI can query it, seems like it will be enough.

    >>>

    >>> I'm thinking about:

    >>>

    >>> - Reading data from Kafka via Spark Streaming

    >>> - Standardizing, then storing it in Cassandra

    >>> - Querying Cassandra from the web ui

    >>>

    >>> That seems like it will work. My question now is whether to use Spark

    >>> Streaming to read Kafka, or use Kafka consumers directly.

    >>>

    >>>

    >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh

    >>> <mi...@gmail.com> wrote:

    >>>>

    >>>> - Spark Streaming to read data from Kafka

    >>>> - Storing the data on HDFS using Flume

    >>>>

    >>>> You don't need Spark streaming to read data from Kafka and store on

    >>>> HDFS. It is a waste of resources.

    >>>>

    >>>> Couple Flume to use Kafka as source and HDFS as sink directly

    >>>>

    >>>> KafkaAgent.sources = kafka-sources

    >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs

    >>>>

    >>>> That will be for your batch layer. To analyse you can directly read from

    >>>> hdfs files with Spark or simply store data in a database of your choice via

    >>>> cron or something. Do not mix your batch layer with speed layer.

    >>>>

    >>>> Your speed layer will ingest the same data directly from Kafka into

    >>>> spark streaming and that will be  online or near real time (defined by your

    >>>> window).

    >>>>

    >>>> Then you have a a serving layer to present data from both speed  (the

    >>>> one from SS) and batch layer.

    >>>>

    >>>> HTH

    >>>>

    >>>>

    >>>>

    >>>>

    >>>> Dr Mich Talebzadeh

    >>>>

    >>>>

    >>>>

    >>>> LinkedIn

    >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

    >>>>

    >>>>

    >>>>

    >>>> http://talebzadehmich.wordpress.com

    >>>>

    >>>>

    >>>> Disclaimer: Use it at your own risk. Any and all responsibility for any

    >>>> loss, damage or destruction of data or any other property which may arise

    >>>> from relying on this email's technical content is explicitly disclaimed. The

    >>>> author will in no case be liable for any monetary damages arising from such

    >>>> loss, damage or destruction.

    >>>>

    >>>>

    >>>>

    >>>>

    >>>> On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com> wrote:

    >>>>>

    >>>>> The web UI is actually the speed layer, it needs to be able to query

    >>>>> the data online, and show the results in real-time.

    >>>>>

    >>>>> It also needs a custom front-end, so a system like Tableau can't be

    >>>>> used, it must have a custom backend + front-end.

    >>>>>

    >>>>> Thanks for the recommendation of Flume. Do you think this will work:

    >>>>>

    >>>>> - Spark Streaming to read data from Kafka

    >>>>> - Storing the data on HDFS using Flume

    >>>>> - Using Spark to query the data in the backend of the web UI?

    >>>>>

    >>>>>

    >>>>>

    >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh

    >>>>> <mi...@gmail.com> wrote:

    >>>>>>

    >>>>>> You need a batch layer and a speed layer. Data from Kafka can be

    >>>>>> stored on HDFS using flume.

    >>>>>>

    >>>>>> -  Query this data to generate reports / analytics (There will be a

    >>>>>> web UI which will be the front-end to the data, and will show the reports)

    >>>>>>

    >>>>>> This is basically batch layer and you need something like Tableau or

    >>>>>> Zeppelin to query data

    >>>>>>

    >>>>>> You will also need spark streaming to query data online for speed

    >>>>>> layer. That data could be stored in some transient fabric like ignite or

    >>>>>> even druid.

    >>>>>>

    >>>>>> HTH

    >>>>>>

    >>>>>>

    >>>>>>

    >>>>>>

    >>>>>>

    >>>>>>

    >>>>>>

    >>>>>>

    >>>>>> Dr Mich Talebzadeh

    >>>>>>

    >>>>>>

    >>>>>>

    >>>>>> LinkedIn

    >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

    >>>>>>

    >>>>>>

    >>>>>>

    >>>>>> http://talebzadehmich.wordpress.com

    >>>>>>

    >>>>>>

    >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for

    >>>>>> any loss, damage or destruction of data or any other property which may

    >>>>>> arise from relying on this email's technical content is explicitly

    >>>>>> disclaimed. The author will in no case be liable for any monetary damages

    >>>>>> arising from such loss, damage or destruction.

    >>>>>>

    >>>>>>

    >>>>>>

    >>>>>>

    >>>>>> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com>

    >>>>>> wrote:

    >>>>>>>

    >>>>>>> It needs to be able to scale to a very large amount of data, yes.

    >>>>>>>

    >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma

    >>>>>>> <de...@gmail.com> wrote:

    >>>>>>>>

    >>>>>>>> What is the message inflow ?

    >>>>>>>> If it's really high , definitely spark will be of great use .

    >>>>>>>>

    >>>>>>>> Thanks

    >>>>>>>> Deepak

    >>>>>>>>

    >>>>>>>>

    >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:

    >>>>>>>>>

    >>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.

    >>>>>>>>>

    >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their

    >>>>>>>>> raw data into Kafka.

    >>>>>>>>>

    >>>>>>>>> I need to:

    >>>>>>>>>

    >>>>>>>>> - Do ETL on the data, and standardize it.

    >>>>>>>>>

    >>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw

    >>>>>>>>> HDFS / ElasticSearch / Postgres)

    >>>>>>>>>

    >>>>>>>>> - Query this data to generate reports / analytics (There will be a

    >>>>>>>>> web UI which will be the front-end to the data, and will show the reports)

    >>>>>>>>>

    >>>>>>>>> Java is being used as the backend language for everything (backend

    >>>>>>>>> of the web UI, as well as the ETL layer)

    >>>>>>>>>

    >>>>>>>>> I'm considering:

    >>>>>>>>>

    >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer

    >>>>>>>>> (receive raw data from Kafka, standardize & store it)

    >>>>>>>>>

    >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized

    >>>>>>>>> data, and to allow queries

    >>>>>>>>>

    >>>>>>>>> - In the backend of the web UI, I could either use Spark to run

    >>>>>>>>> queries across the data (mostly filters), or directly run queries against

    >>>>>>>>> Cassandra / HBase

    >>>>>>>>>

    >>>>>>>>> I'd appreciate some thoughts / suggestions on which of these

    >>>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for

    >>>>>>>>> ETL, which persistent data store to use, and how to query that data store in

    >>>>>>>>> the backend of the web UI, for displaying the reports).

    >>>>>>>>>

    >>>>>>>>>

    >>>>>>>>> Thanks.

    >>>>>>>

    >>>>>>>

    >>>>>>

    >>>>>

    >>>>

    >>>

    >>

    >>

    >>

    >> --

    >> Thanks

    >> Deepak

    >> www.bigdatabig.com

    >> www.keosha.net

    >

    >

Re: Architecture recommendations for a tricky use case

Posted by Gwen Shapira <gw...@confluent.io>.

The original post made no mention of throughput or latency or
correctness requirements, so pretty much any data store will fit the
bill... discussion of "what is better" degrade fast when there are no
concrete standards to choose between.

Who cares about anything when we don't know what we need? :)

On Thu, Sep 29, 2016 at 9:23 AM, Cody Koeninger <co...@koeninger.org> wrote:
>> I still don't understand why writing to a transactional database with locking and concurrency (read and writes) through JDBC will be fast for this sort of data ingestion.
>
> Who cares about fast if your data is wrong?  And it's still plenty fast enough
>
> https://youtu.be/NVl9_6J1G60?list=WL&t=1819
>
> https://www.citusdata.com/blog/2016/09/22/announcing-citus-mx/
>
>
>
> On Thu, Sep 29, 2016 at 11:16 AM, Mich Talebzadeh
> <mi...@gmail.com> wrote:
>> The way I see this, there are two things involved.
>>
>> Data ingestion through source to Kafka
>> Date conversion and Storage ETL/ELT
>> Presentation
>>
>> Item 2 is the one that needs to be designed correctly. I presume raw data
>> has to confirm to some form of MDM that requires schema mapping etc before
>> putting into persistent storage (DB, HDFS etc). Which one to choose depends
>> on your volume of ingestion and your cluster size and complexity of data
>> conversion. Then your users will use some form of UI (Tableau, QlikView,
>> Zeppelin, direct SQL) to query data one way or other. Your users can
>> directly use UI like Tableau that offer in built analytics on SQL. Spark sql
>> offers the same). Your mileage varies according to your needs.
>>
>> I still don't understand why writing to a transactional database with
>> locking and concurrency (read and writes) through JDBC will be fast for this
>> sort of data ingestion. If you ask me if I wanted to choose an RDBMS to
>> write to as my sink,I would use Oracle which offers the best locking and
>> concurrency among RDBMs and also handles key value pairs as well (assuming
>> that is what you want). In addition, it can be used as a Data Warehouse as
>> well.
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>> loss, damage or destruction of data or any other property which may arise
>> from relying on this email's technical content is explicitly disclaimed. The
>> author will in no case be liable for any monetary damages arising from such
>> loss, damage or destruction.
>>
>>
>>
>>
>> On 29 September 2016 at 16:49, Ali Akhtar <al...@gmail.com> wrote:
>>>
>>> The business use case is to read a user's data from a variety of different
>>> services through their API, and then allowing the user to query that data,
>>> on a per service basis, as well as an aggregate across all services.
>>>
>>> The way I'm considering doing it, is to do some basic ETL (drop all the
>>> unnecessary fields, rename some fields into something more manageable, etc)
>>> and then store the data in Cassandra / Postgres.
>>>
>>> Then, when the user wants to view a particular report, query the
>>> respective table in Cassandra / Postgres. (select .. from data where user =
>>> ? and date between <start> and <end> and some_field = ?)
>>>
>>> How will Spark Streaming help w/ aggregation? Couldn't the data be queried
>>> from Cassandra / Postgres via the Kafka consumer and aggregated that way?
>>>
>>> On Thu, Sep 29, 2016 at 8:43 PM, Cody Koeninger <co...@koeninger.org>
>>> wrote:
>>>>
>>>> No, direct stream in and of itself won't ensure an end-to-end
>>>> guarantee, because it doesn't know anything about your output actions.
>>>>
>>>> You still need to do some work.  The point is having easy access to
>>>> offsets for batches on a per-partition basis makes it easier to do
>>>> that work, especially in conjunction with aggregation.
>>>>
>>>> On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma <de...@gmail.com>
>>>> wrote:
>>>> > If you use spark direct streams , it ensure end to end guarantee for
>>>> > messages.
>>>> >
>>>> >
>>>> > On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar <al...@gmail.com>
>>>> > wrote:
>>>> >>
>>>> >> My concern with Postgres / Cassandra is only scalability. I will look
>>>> >> further into Postgres horizontal scaling, thanks.
>>>> >>
>>>> >> Writes could be idempotent if done as upserts, otherwise updates will
>>>> >> be
>>>> >> idempotent but not inserts.
>>>> >>
>>>> >> Data should not be lost. The system should be as fault tolerant as
>>>> >> possible.
>>>> >>
>>>> >> What's the advantage of using Spark for reading Kafka instead of
>>>> >> direct
>>>> >> Kafka consumers?
>>>> >>
>>>> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <co...@koeninger.org>
>>>> >> wrote:
>>>> >>>
>>>> >>> I wouldn't give up the flexibility and maturity of a relational
>>>> >>> database, unless you have a very specific use case.  I'm not trashing
>>>> >>> cassandra, I've used cassandra, but if all I know is that you're
>>>> >>> doing
>>>> >>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>>>> >>> aggregations without a lot of forethought.  If you're worried about
>>>> >>> scaling, there are several options for horizontally scaling Postgres
>>>> >>> in particular.  One of the current best from what I've worked with is
>>>> >>> Citus.
>>>> >>>
>>>> >>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma
>>>> >>> <de...@gmail.com>
>>>> >>> wrote:
>>>> >>> > Hi Cody
>>>> >>> > Spark direct stream is just fine for this use case.
>>>> >>> > But why postgres and not cassandra?
>>>> >>> > Is there anything specific here that i may not be aware?
>>>> >>> >
>>>> >>> > Thanks
>>>> >>> > Deepak
>>>> >>> >
>>>> >>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger
>>>> >>> > <co...@koeninger.org>
>>>> >>> > wrote:
>>>> >>> >>
>>>> >>> >> How are you going to handle etl failures?  Do you care about lost
>>>> >>> >> /
>>>> >>> >> duplicated data?  Are your writes idempotent?
>>>> >>> >>
>>>> >>> >> Absent any other information about the problem, I'd stay away from
>>>> >>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>>>> >>> >> feeding postgres.
>>>> >>> >>
>>>> >>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar
>>>> >>> >> <al...@gmail.com>
>>>> >>> >> wrote:
>>>> >>> >> > Is there an advantage to that vs directly consuming from Kafka?
>>>> >>> >> > Nothing
>>>> >>> >> > is
>>>> >>> >> > being done to the data except some light ETL and then storing it
>>>> >>> >> > in
>>>> >>> >> > Cassandra
>>>> >>> >> >
>>>> >>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma
>>>> >>> >> > <de...@gmail.com>
>>>> >>> >> > wrote:
>>>> >>> >> >>
>>>> >>> >> >> Its better you use spark's direct stream to ingest from kafka.
>>>> >>> >> >>
>>>> >>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar
>>>> >>> >> >> <al...@gmail.com>
>>>> >>> >> >> wrote:
>>>> >>> >> >>>
>>>> >>> >> >>> I don't think I need a different speed storage and batch
>>>> >>> >> >>> storage.
>>>> >>> >> >>> Just
>>>> >>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>>>> >>> >> >>> somewhere
>>>> >>> >> >>> where
>>>> >>> >> >>> the web UI can query it, seems like it will be enough.
>>>> >>> >> >>>
>>>> >>> >> >>> I'm thinking about:
>>>> >>> >> >>>
>>>> >>> >> >>> - Reading data from Kafka via Spark Streaming
>>>> >>> >> >>> - Standardizing, then storing it in Cassandra
>>>> >>> >> >>> - Querying Cassandra from the web ui
>>>> >>> >> >>>
>>>> >>> >> >>> That seems like it will work. My question now is whether to
>>>> >>> >> >>> use
>>>> >>> >> >>> Spark
>>>> >>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>>>> >>> >> >>>
>>>> >>> >> >>>
>>>> >>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>>>> >>> >> >>> <mi...@gmail.com> wrote:
>>>> >>> >> >>>>
>>>> >>> >> >>>> - Spark Streaming to read data from Kafka
>>>> >>> >> >>>> - Storing the data on HDFS using Flume
>>>> >>> >> >>>>
>>>> >>> >> >>>> You don't need Spark streaming to read data from Kafka and
>>>> >>> >> >>>> store
>>>> >>> >> >>>> on
>>>> >>> >> >>>> HDFS. It is a waste of resources.
>>>> >>> >> >>>>
>>>> >>> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
>>>> >>> >> >>>>
>>>> >>> >> >>>> KafkaAgent.sources = kafka-sources
>>>> >>> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>>> >>> >> >>>>
>>>> >>> >> >>>> That will be for your batch layer. To analyse you can
>>>> >>> >> >>>> directly
>>>> >>> >> >>>> read
>>>> >>> >> >>>> from
>>>> >>> >> >>>> hdfs files with Spark or simply store data in a database of
>>>> >>> >> >>>> your
>>>> >>> >> >>>> choice via
>>>> >>> >> >>>> cron or something. Do not mix your batch layer with speed
>>>> >>> >> >>>> layer.
>>>> >>> >> >>>>
>>>> >>> >> >>>> Your speed layer will ingest the same data directly from
>>>> >>> >> >>>> Kafka
>>>> >>> >> >>>> into
>>>> >>> >> >>>> spark streaming and that will be  online or near real time
>>>> >>> >> >>>> (defined
>>>> >>> >> >>>> by your
>>>> >>> >> >>>> window).
>>>> >>> >> >>>>
>>>> >>> >> >>>> Then you have a a serving layer to present data from both
>>>> >>> >> >>>> speed
>>>> >>> >> >>>> (the
>>>> >>> >> >>>> one from SS) and batch layer.
>>>> >>> >> >>>>
>>>> >>> >> >>>> HTH
>>>> >>> >> >>>>
>>>> >>> >> >>>>
>>>> >>> >> >>>>
>>>> >>> >> >>>>
>>>> >>> >> >>>> Dr Mich Talebzadeh
>>>> >>> >> >>>>
>>>> >>> >> >>>>
>>>> >>> >> >>>>
>>>> >>> >> >>>> LinkedIn
>>>> >>> >> >>>>
>>>> >>> >> >>>>
>>>> >>> >> >>>>
>>>> >>> >> >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> >>> >> >>>>
>>>> >>> >> >>>>
>>>> >>> >> >>>>
>>>> >>> >> >>>> http://talebzadehmich.wordpress.com
>>>> >>> >> >>>>
>>>> >>> >> >>>>
>>>> >>> >> >>>> Disclaimer: Use it at your own risk. Any and all
>>>> >>> >> >>>> responsibility
>>>> >>> >> >>>> for
>>>> >>> >> >>>> any
>>>> >>> >> >>>> loss, damage or destruction of data or any other property
>>>> >>> >> >>>> which
>>>> >>> >> >>>> may
>>>> >>> >> >>>> arise
>>>> >>> >> >>>> from relying on this email's technical content is explicitly
>>>> >>> >> >>>> disclaimed. The
>>>> >>> >> >>>> author will in no case be liable for any monetary damages
>>>> >>> >> >>>> arising
>>>> >>> >> >>>> from such
>>>> >>> >> >>>> loss, damage or destruction.
>>>> >>> >> >>>>
>>>> >>> >> >>>>
>>>> >>> >> >>>>
>>>> >>> >> >>>>
>>>> >>> >> >>>> On 29 September 2016 at 15:15, Ali Akhtar
>>>> >>> >> >>>> <al...@gmail.com>
>>>> >>> >> >>>> wrote:
>>>> >>> >> >>>>>
>>>> >>> >> >>>>> The web UI is actually the speed layer, it needs to be able
>>>> >>> >> >>>>> to
>>>> >>> >> >>>>> query
>>>> >>> >> >>>>> the data online, and show the results in real-time.
>>>> >>> >> >>>>>
>>>> >>> >> >>>>> It also needs a custom front-end, so a system like Tableau
>>>> >>> >> >>>>> can't
>>>> >>> >> >>>>> be
>>>> >>> >> >>>>> used, it must have a custom backend + front-end.
>>>> >>> >> >>>>>
>>>> >>> >> >>>>> Thanks for the recommendation of Flume. Do you think this
>>>> >>> >> >>>>> will
>>>> >>> >> >>>>> work:
>>>> >>> >> >>>>>
>>>> >>> >> >>>>> - Spark Streaming to read data from Kafka
>>>> >>> >> >>>>> - Storing the data on HDFS using Flume
>>>> >>> >> >>>>> - Using Spark to query the data in the backend of the web
>>>> >>> >> >>>>> UI?
>>>> >>> >> >>>>>
>>>> >>> >> >>>>>
>>>> >>> >> >>>>>
>>>> >>> >> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
>>>> >>> >> >>>>> <mi...@gmail.com> wrote:
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>> You need a batch layer and a speed layer. Data from Kafka
>>>> >>> >> >>>>>> can
>>>> >>> >> >>>>>> be
>>>> >>> >> >>>>>> stored on HDFS using flume.
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>> -  Query this data to generate reports / analytics (There
>>>> >>> >> >>>>>> will
>>>> >>> >> >>>>>> be a
>>>> >>> >> >>>>>> web UI which will be the front-end to the data, and will
>>>> >>> >> >>>>>> show
>>>> >>> >> >>>>>> the
>>>> >>> >> >>>>>> reports)
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>> This is basically batch layer and you need something like
>>>> >>> >> >>>>>> Tableau
>>>> >>> >> >>>>>> or
>>>> >>> >> >>>>>> Zeppelin to query data
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>> You will also need spark streaming to query data online for
>>>> >>> >> >>>>>> speed
>>>> >>> >> >>>>>> layer. That data could be stored in some transient fabric
>>>> >>> >> >>>>>> like
>>>> >>> >> >>>>>> ignite or
>>>> >>> >> >>>>>> even druid.
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>> HTH
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>> Dr Mich Talebzadeh
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>> LinkedIn
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>> http://talebzadehmich.wordpress.com
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>> Disclaimer: Use it at your own risk. Any and all
>>>> >>> >> >>>>>> responsibility
>>>> >>> >> >>>>>> for
>>>> >>> >> >>>>>> any loss, damage or destruction of data or any other
>>>> >>> >> >>>>>> property
>>>> >>> >> >>>>>> which
>>>> >>> >> >>>>>> may
>>>> >>> >> >>>>>> arise from relying on this email's technical content is
>>>> >>> >> >>>>>> explicitly
>>>> >>> >> >>>>>> disclaimed. The author will in no case be liable for any
>>>> >>> >> >>>>>> monetary
>>>> >>> >> >>>>>> damages
>>>> >>> >> >>>>>> arising from such loss, damage or destruction.
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar
>>>> >>> >> >>>>>> <al...@gmail.com>
>>>> >>> >> >>>>>> wrote:
>>>> >>> >> >>>>>>>
>>>> >>> >> >>>>>>> It needs to be able to scale to a very large amount of
>>>> >>> >> >>>>>>> data,
>>>> >>> >> >>>>>>> yes.
>>>> >>> >> >>>>>>>
>>>> >>> >> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
>>>> >>> >> >>>>>>> <de...@gmail.com> wrote:
>>>> >>> >> >>>>>>>>
>>>> >>> >> >>>>>>>> What is the message inflow ?
>>>> >>> >> >>>>>>>> If it's really high , definitely spark will be of great
>>>> >>> >> >>>>>>>> use .
>>>> >>> >> >>>>>>>>
>>>> >>> >> >>>>>>>> Thanks
>>>> >>> >> >>>>>>>> Deepak
>>>> >>> >> >>>>>>>>
>>>> >>> >> >>>>>>>>
>>>> >>> >> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar"
>>>> >>> >> >>>>>>>> <al...@gmail.com>
>>>> >>> >> >>>>>>>> wrote:
>>>> >>> >> >>>>>>>>>
>>>> >>> >> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for
>>>> >>> >> >>>>>>>>> ideas.
>>>> >>> >> >>>>>>>>>
>>>> >>> >> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and
>>>> >>> >> >>>>>>>>> writing
>>>> >>> >> >>>>>>>>> their
>>>> >>> >> >>>>>>>>> raw data into Kafka.
>>>> >>> >> >>>>>>>>>
>>>> >>> >> >>>>>>>>> I need to:
>>>> >>> >> >>>>>>>>>
>>>> >>> >> >>>>>>>>> - Do ETL on the data, and standardize it.
>>>> >>> >> >>>>>>>>>
>>>> >>> >> >>>>>>>>> - Store the standardized data somewhere (HBase /
>>>> >>> >> >>>>>>>>> Cassandra /
>>>> >>> >> >>>>>>>>> Raw
>>>> >>> >> >>>>>>>>> HDFS / ElasticSearch / Postgres)
>>>> >>> >> >>>>>>>>>
>>>> >>> >> >>>>>>>>> - Query this data to generate reports / analytics (There
>>>> >>> >> >>>>>>>>> will be
>>>> >>> >> >>>>>>>>> a
>>>> >>> >> >>>>>>>>> web UI which will be the front-end to the data, and will
>>>> >>> >> >>>>>>>>> show
>>>> >>> >> >>>>>>>>> the reports)
>>>> >>> >> >>>>>>>>>
>>>> >>> >> >>>>>>>>> Java is being used as the backend language for
>>>> >>> >> >>>>>>>>> everything
>>>> >>> >> >>>>>>>>> (backend
>>>> >>> >> >>>>>>>>> of the web UI, as well as the ETL layer)
>>>> >>> >> >>>>>>>>>
>>>> >>> >> >>>>>>>>> I'm considering:
>>>> >>> >> >>>>>>>>>
>>>> >>> >> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the
>>>> >>> >> >>>>>>>>> ETL
>>>> >>> >> >>>>>>>>> layer
>>>> >>> >> >>>>>>>>> (receive raw data from Kafka, standardize & store it)
>>>> >>> >> >>>>>>>>>
>>>> >>> >> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the
>>>> >>> >> >>>>>>>>> standardized
>>>> >>> >> >>>>>>>>> data, and to allow queries
>>>> >>> >> >>>>>>>>>
>>>> >>> >> >>>>>>>>> - In the backend of the web UI, I could either use Spark
>>>> >>> >> >>>>>>>>> to
>>>> >>> >> >>>>>>>>> run
>>>> >>> >> >>>>>>>>> queries across the data (mostly filters), or directly
>>>> >>> >> >>>>>>>>> run
>>>> >>> >> >>>>>>>>> queries against
>>>> >>> >> >>>>>>>>> Cassandra / HBase
>>>> >>> >> >>>>>>>>>
>>>> >>> >> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of
>>>> >>> >> >>>>>>>>> these
>>>> >>> >> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka
>>>> >>> >> >>>>>>>>> consumers vs
>>>> >>> >> >>>>>>>>> Spark for
>>>> >>> >> >>>>>>>>> ETL, which persistent data store to use, and how to
>>>> >>> >> >>>>>>>>> query
>>>> >>> >> >>>>>>>>> that
>>>> >>> >> >>>>>>>>> data store in
>>>> >>> >> >>>>>>>>> the backend of the web UI, for displaying the reports).
>>>> >>> >> >>>>>>>>>
>>>> >>> >> >>>>>>>>>
>>>> >>> >> >>>>>>>>> Thanks.
>>>> >>> >> >>>>>>>
>>>> >>> >> >>>>>>>
>>>> >>> >> >>>>>>
>>>> >>> >> >>>>>
>>>> >>> >> >>>>
>>>> >>> >> >>>
>>>> >>> >> >>
>>>> >>> >> >>
>>>> >>> >> >>
>>>> >>> >> >> --
>>>> >>> >> >> Thanks
>>>> >>> >> >> Deepak
>>>> >>> >> >> www.bigdatabig.com
>>>> >>> >> >> www.keosha.net
>>>> >>> >> >
>>>> >>> >> >
>>>> >>> >>
>>>> >>> >>
>>>> >>> >> ---------------------------------------------------------------------
>>>> >>> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>> >>> >>
>>>> >>> >
>>>> >>> >
>>>> >>> >
>>>> >>> > --
>>>> >>> > Thanks
>>>> >>> > Deepak
>>>> >>> > www.bigdatabig.com
>>>> >>> > www.keosha.net
>>>> >>
>>>> >>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Thanks
>>>> > Deepak
>>>> > www.bigdatabig.com
>>>> > www.keosha.net
>>>
>>>
>>



-- 
Gwen Shapira
Product Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter | blog

Re: Architecture recommendations for a tricky use case

Posted by Cody Koeninger <co...@koeninger.org>.

> I still don't understand why writing to a transactional database with locking and concurrency (read and writes) through JDBC will be fast for this sort of data ingestion.

Who cares about fast if your data is wrong?  And it's still plenty fast enough

https://youtu.be/NVl9_6J1G60?list=WL&t=1819

https://www.citusdata.com/blog/2016/09/22/announcing-citus-mx/



On Thu, Sep 29, 2016 at 11:16 AM, Mich Talebzadeh
<mi...@gmail.com> wrote:
> The way I see this, there are two things involved.
>
> Data ingestion through source to Kafka
> Date conversion and Storage ETL/ELT
> Presentation
>
> Item 2 is the one that needs to be designed correctly. I presume raw data
> has to confirm to some form of MDM that requires schema mapping etc before
> putting into persistent storage (DB, HDFS etc). Which one to choose depends
> on your volume of ingestion and your cluster size and complexity of data
> conversion. Then your users will use some form of UI (Tableau, QlikView,
> Zeppelin, direct SQL) to query data one way or other. Your users can
> directly use UI like Tableau that offer in built analytics on SQL. Spark sql
> offers the same). Your mileage varies according to your needs.
>
> I still don't understand why writing to a transactional database with
> locking and concurrency (read and writes) through JDBC will be fast for this
> sort of data ingestion. If you ask me if I wanted to choose an RDBMS to
> write to as my sink,I would use Oracle which offers the best locking and
> concurrency among RDBMs and also handles key value pairs as well (assuming
> that is what you want). In addition, it can be used as a Data Warehouse as
> well.
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
>
> On 29 September 2016 at 16:49, Ali Akhtar <al...@gmail.com> wrote:
>>
>> The business use case is to read a user's data from a variety of different
>> services through their API, and then allowing the user to query that data,
>> on a per service basis, as well as an aggregate across all services.
>>
>> The way I'm considering doing it, is to do some basic ETL (drop all the
>> unnecessary fields, rename some fields into something more manageable, etc)
>> and then store the data in Cassandra / Postgres.
>>
>> Then, when the user wants to view a particular report, query the
>> respective table in Cassandra / Postgres. (select .. from data where user =
>> ? and date between <start> and <end> and some_field = ?)
>>
>> How will Spark Streaming help w/ aggregation? Couldn't the data be queried
>> from Cassandra / Postgres via the Kafka consumer and aggregated that way?
>>
>> On Thu, Sep 29, 2016 at 8:43 PM, Cody Koeninger <co...@koeninger.org>
>> wrote:
>>>
>>> No, direct stream in and of itself won't ensure an end-to-end
>>> guarantee, because it doesn't know anything about your output actions.
>>>
>>> You still need to do some work.  The point is having easy access to
>>> offsets for batches on a per-partition basis makes it easier to do
>>> that work, especially in conjunction with aggregation.
>>>
>>> On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma <de...@gmail.com>
>>> wrote:
>>> > If you use spark direct streams , it ensure end to end guarantee for
>>> > messages.
>>> >
>>> >
>>> > On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar <al...@gmail.com>
>>> > wrote:
>>> >>
>>> >> My concern with Postgres / Cassandra is only scalability. I will look
>>> >> further into Postgres horizontal scaling, thanks.
>>> >>
>>> >> Writes could be idempotent if done as upserts, otherwise updates will
>>> >> be
>>> >> idempotent but not inserts.
>>> >>
>>> >> Data should not be lost. The system should be as fault tolerant as
>>> >> possible.
>>> >>
>>> >> What's the advantage of using Spark for reading Kafka instead of
>>> >> direct
>>> >> Kafka consumers?
>>> >>
>>> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <co...@koeninger.org>
>>> >> wrote:
>>> >>>
>>> >>> I wouldn't give up the flexibility and maturity of a relational
>>> >>> database, unless you have a very specific use case.  I'm not trashing
>>> >>> cassandra, I've used cassandra, but if all I know is that you're
>>> >>> doing
>>> >>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>>> >>> aggregations without a lot of forethought.  If you're worried about
>>> >>> scaling, there are several options for horizontally scaling Postgres
>>> >>> in particular.  One of the current best from what I've worked with is
>>> >>> Citus.
>>> >>>
>>> >>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma
>>> >>> <de...@gmail.com>
>>> >>> wrote:
>>> >>> > Hi Cody
>>> >>> > Spark direct stream is just fine for this use case.
>>> >>> > But why postgres and not cassandra?
>>> >>> > Is there anything specific here that i may not be aware?
>>> >>> >
>>> >>> > Thanks
>>> >>> > Deepak
>>> >>> >
>>> >>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger
>>> >>> > <co...@koeninger.org>
>>> >>> > wrote:
>>> >>> >>
>>> >>> >> How are you going to handle etl failures?  Do you care about lost
>>> >>> >> /
>>> >>> >> duplicated data?  Are your writes idempotent?
>>> >>> >>
>>> >>> >> Absent any other information about the problem, I'd stay away from
>>> >>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>>> >>> >> feeding postgres.
>>> >>> >>
>>> >>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar
>>> >>> >> <al...@gmail.com>
>>> >>> >> wrote:
>>> >>> >> > Is there an advantage to that vs directly consuming from Kafka?
>>> >>> >> > Nothing
>>> >>> >> > is
>>> >>> >> > being done to the data except some light ETL and then storing it
>>> >>> >> > in
>>> >>> >> > Cassandra
>>> >>> >> >
>>> >>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma
>>> >>> >> > <de...@gmail.com>
>>> >>> >> > wrote:
>>> >>> >> >>
>>> >>> >> >> Its better you use spark's direct stream to ingest from kafka.
>>> >>> >> >>
>>> >>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar
>>> >>> >> >> <al...@gmail.com>
>>> >>> >> >> wrote:
>>> >>> >> >>>
>>> >>> >> >>> I don't think I need a different speed storage and batch
>>> >>> >> >>> storage.
>>> >>> >> >>> Just
>>> >>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>>> >>> >> >>> somewhere
>>> >>> >> >>> where
>>> >>> >> >>> the web UI can query it, seems like it will be enough.
>>> >>> >> >>>
>>> >>> >> >>> I'm thinking about:
>>> >>> >> >>>
>>> >>> >> >>> - Reading data from Kafka via Spark Streaming
>>> >>> >> >>> - Standardizing, then storing it in Cassandra
>>> >>> >> >>> - Querying Cassandra from the web ui
>>> >>> >> >>>
>>> >>> >> >>> That seems like it will work. My question now is whether to
>>> >>> >> >>> use
>>> >>> >> >>> Spark
>>> >>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>>> >>> >> >>>
>>> >>> >> >>>
>>> >>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>>> >>> >> >>> <mi...@gmail.com> wrote:
>>> >>> >> >>>>
>>> >>> >> >>>> - Spark Streaming to read data from Kafka
>>> >>> >> >>>> - Storing the data on HDFS using Flume
>>> >>> >> >>>>
>>> >>> >> >>>> You don't need Spark streaming to read data from Kafka and
>>> >>> >> >>>> store
>>> >>> >> >>>> on
>>> >>> >> >>>> HDFS. It is a waste of resources.
>>> >>> >> >>>>
>>> >>> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
>>> >>> >> >>>>
>>> >>> >> >>>> KafkaAgent.sources = kafka-sources
>>> >>> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>> >>> >> >>>>
>>> >>> >> >>>> That will be for your batch layer. To analyse you can
>>> >>> >> >>>> directly
>>> >>> >> >>>> read
>>> >>> >> >>>> from
>>> >>> >> >>>> hdfs files with Spark or simply store data in a database of
>>> >>> >> >>>> your
>>> >>> >> >>>> choice via
>>> >>> >> >>>> cron or something. Do not mix your batch layer with speed
>>> >>> >> >>>> layer.
>>> >>> >> >>>>
>>> >>> >> >>>> Your speed layer will ingest the same data directly from
>>> >>> >> >>>> Kafka
>>> >>> >> >>>> into
>>> >>> >> >>>> spark streaming and that will be  online or near real time
>>> >>> >> >>>> (defined
>>> >>> >> >>>> by your
>>> >>> >> >>>> window).
>>> >>> >> >>>>
>>> >>> >> >>>> Then you have a a serving layer to present data from both
>>> >>> >> >>>> speed
>>> >>> >> >>>> (the
>>> >>> >> >>>> one from SS) and batch layer.
>>> >>> >> >>>>
>>> >>> >> >>>> HTH
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>> Dr Mich Talebzadeh
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>> LinkedIn
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>> http://talebzadehmich.wordpress.com
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>> Disclaimer: Use it at your own risk. Any and all
>>> >>> >> >>>> responsibility
>>> >>> >> >>>> for
>>> >>> >> >>>> any
>>> >>> >> >>>> loss, damage or destruction of data or any other property
>>> >>> >> >>>> which
>>> >>> >> >>>> may
>>> >>> >> >>>> arise
>>> >>> >> >>>> from relying on this email's technical content is explicitly
>>> >>> >> >>>> disclaimed. The
>>> >>> >> >>>> author will in no case be liable for any monetary damages
>>> >>> >> >>>> arising
>>> >>> >> >>>> from such
>>> >>> >> >>>> loss, damage or destruction.
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>> On 29 September 2016 at 15:15, Ali Akhtar
>>> >>> >> >>>> <al...@gmail.com>
>>> >>> >> >>>> wrote:
>>> >>> >> >>>>>
>>> >>> >> >>>>> The web UI is actually the speed layer, it needs to be able
>>> >>> >> >>>>> to
>>> >>> >> >>>>> query
>>> >>> >> >>>>> the data online, and show the results in real-time.
>>> >>> >> >>>>>
>>> >>> >> >>>>> It also needs a custom front-end, so a system like Tableau
>>> >>> >> >>>>> can't
>>> >>> >> >>>>> be
>>> >>> >> >>>>> used, it must have a custom backend + front-end.
>>> >>> >> >>>>>
>>> >>> >> >>>>> Thanks for the recommendation of Flume. Do you think this
>>> >>> >> >>>>> will
>>> >>> >> >>>>> work:
>>> >>> >> >>>>>
>>> >>> >> >>>>> - Spark Streaming to read data from Kafka
>>> >>> >> >>>>> - Storing the data on HDFS using Flume
>>> >>> >> >>>>> - Using Spark to query the data in the backend of the web
>>> >>> >> >>>>> UI?
>>> >>> >> >>>>>
>>> >>> >> >>>>>
>>> >>> >> >>>>>
>>> >>> >> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
>>> >>> >> >>>>> <mi...@gmail.com> wrote:
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> You need a batch layer and a speed layer. Data from Kafka
>>> >>> >> >>>>>> can
>>> >>> >> >>>>>> be
>>> >>> >> >>>>>> stored on HDFS using flume.
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> -  Query this data to generate reports / analytics (There
>>> >>> >> >>>>>> will
>>> >>> >> >>>>>> be a
>>> >>> >> >>>>>> web UI which will be the front-end to the data, and will
>>> >>> >> >>>>>> show
>>> >>> >> >>>>>> the
>>> >>> >> >>>>>> reports)
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> This is basically batch layer and you need something like
>>> >>> >> >>>>>> Tableau
>>> >>> >> >>>>>> or
>>> >>> >> >>>>>> Zeppelin to query data
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> You will also need spark streaming to query data online for
>>> >>> >> >>>>>> speed
>>> >>> >> >>>>>> layer. That data could be stored in some transient fabric
>>> >>> >> >>>>>> like
>>> >>> >> >>>>>> ignite or
>>> >>> >> >>>>>> even druid.
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> HTH
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> Dr Mich Talebzadeh
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> LinkedIn
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> http://talebzadehmich.wordpress.com
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> Disclaimer: Use it at your own risk. Any and all
>>> >>> >> >>>>>> responsibility
>>> >>> >> >>>>>> for
>>> >>> >> >>>>>> any loss, damage or destruction of data or any other
>>> >>> >> >>>>>> property
>>> >>> >> >>>>>> which
>>> >>> >> >>>>>> may
>>> >>> >> >>>>>> arise from relying on this email's technical content is
>>> >>> >> >>>>>> explicitly
>>> >>> >> >>>>>> disclaimed. The author will in no case be liable for any
>>> >>> >> >>>>>> monetary
>>> >>> >> >>>>>> damages
>>> >>> >> >>>>>> arising from such loss, damage or destruction.
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar
>>> >>> >> >>>>>> <al...@gmail.com>
>>> >>> >> >>>>>> wrote:
>>> >>> >> >>>>>>>
>>> >>> >> >>>>>>> It needs to be able to scale to a very large amount of
>>> >>> >> >>>>>>> data,
>>> >>> >> >>>>>>> yes.
>>> >>> >> >>>>>>>
>>> >>> >> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
>>> >>> >> >>>>>>> <de...@gmail.com> wrote:
>>> >>> >> >>>>>>>>
>>> >>> >> >>>>>>>> What is the message inflow ?
>>> >>> >> >>>>>>>> If it's really high , definitely spark will be of great
>>> >>> >> >>>>>>>> use .
>>> >>> >> >>>>>>>>
>>> >>> >> >>>>>>>> Thanks
>>> >>> >> >>>>>>>> Deepak
>>> >>> >> >>>>>>>>
>>> >>> >> >>>>>>>>
>>> >>> >> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar"
>>> >>> >> >>>>>>>> <al...@gmail.com>
>>> >>> >> >>>>>>>> wrote:
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for
>>> >>> >> >>>>>>>>> ideas.
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and
>>> >>> >> >>>>>>>>> writing
>>> >>> >> >>>>>>>>> their
>>> >>> >> >>>>>>>>> raw data into Kafka.
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> I need to:
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> - Do ETL on the data, and standardize it.
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> - Store the standardized data somewhere (HBase /
>>> >>> >> >>>>>>>>> Cassandra /
>>> >>> >> >>>>>>>>> Raw
>>> >>> >> >>>>>>>>> HDFS / ElasticSearch / Postgres)
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> - Query this data to generate reports / analytics (There
>>> >>> >> >>>>>>>>> will be
>>> >>> >> >>>>>>>>> a
>>> >>> >> >>>>>>>>> web UI which will be the front-end to the data, and will
>>> >>> >> >>>>>>>>> show
>>> >>> >> >>>>>>>>> the reports)
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> Java is being used as the backend language for
>>> >>> >> >>>>>>>>> everything
>>> >>> >> >>>>>>>>> (backend
>>> >>> >> >>>>>>>>> of the web UI, as well as the ETL layer)
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> I'm considering:
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the
>>> >>> >> >>>>>>>>> ETL
>>> >>> >> >>>>>>>>> layer
>>> >>> >> >>>>>>>>> (receive raw data from Kafka, standardize & store it)
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the
>>> >>> >> >>>>>>>>> standardized
>>> >>> >> >>>>>>>>> data, and to allow queries
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> - In the backend of the web UI, I could either use Spark
>>> >>> >> >>>>>>>>> to
>>> >>> >> >>>>>>>>> run
>>> >>> >> >>>>>>>>> queries across the data (mostly filters), or directly
>>> >>> >> >>>>>>>>> run
>>> >>> >> >>>>>>>>> queries against
>>> >>> >> >>>>>>>>> Cassandra / HBase
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of
>>> >>> >> >>>>>>>>> these
>>> >>> >> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka
>>> >>> >> >>>>>>>>> consumers vs
>>> >>> >> >>>>>>>>> Spark for
>>> >>> >> >>>>>>>>> ETL, which persistent data store to use, and how to
>>> >>> >> >>>>>>>>> query
>>> >>> >> >>>>>>>>> that
>>> >>> >> >>>>>>>>> data store in
>>> >>> >> >>>>>>>>> the backend of the web UI, for displaying the reports).
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> Thanks.
>>> >>> >> >>>>>>>
>>> >>> >> >>>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>
>>> >>> >> >>>>
>>> >>> >> >>>
>>> >>> >> >>
>>> >>> >> >>
>>> >>> >> >>
>>> >>> >> >> --
>>> >>> >> >> Thanks
>>> >>> >> >> Deepak
>>> >>> >> >> www.bigdatabig.com
>>> >>> >> >> www.keosha.net
>>> >>> >> >
>>> >>> >> >
>>> >>> >>
>>> >>> >>
>>> >>> >> ---------------------------------------------------------------------
>>> >>> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>> >>> >>
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > --
>>> >>> > Thanks
>>> >>> > Deepak
>>> >>> > www.bigdatabig.com
>>> >>> > www.keosha.net
>>> >>
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Thanks
>>> > Deepak
>>> > www.bigdatabig.com
>>> > www.keosha.net
>>
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Architecture recommendations for a tricky use case

Posted by Cody Koeninger <co...@koeninger.org>.

> I still don't understand why writing to a transactional database with locking and concurrency (read and writes) through JDBC will be fast for this sort of data ingestion.

Who cares about fast if your data is wrong?  And it's still plenty fast enough

https://youtu.be/NVl9_6J1G60?list=WL&t=1819

https://www.citusdata.com/blog/2016/09/22/announcing-citus-mx/



On Thu, Sep 29, 2016 at 11:16 AM, Mich Talebzadeh
<mi...@gmail.com> wrote:
> The way I see this, there are two things involved.
>
> Data ingestion through source to Kafka
> Date conversion and Storage ETL/ELT
> Presentation
>
> Item 2 is the one that needs to be designed correctly. I presume raw data
> has to confirm to some form of MDM that requires schema mapping etc before
> putting into persistent storage (DB, HDFS etc). Which one to choose depends
> on your volume of ingestion and your cluster size and complexity of data
> conversion. Then your users will use some form of UI (Tableau, QlikView,
> Zeppelin, direct SQL) to query data one way or other. Your users can
> directly use UI like Tableau that offer in built analytics on SQL. Spark sql
> offers the same). Your mileage varies according to your needs.
>
> I still don't understand why writing to a transactional database with
> locking and concurrency (read and writes) through JDBC will be fast for this
> sort of data ingestion. If you ask me if I wanted to choose an RDBMS to
> write to as my sink,I would use Oracle which offers the best locking and
> concurrency among RDBMs and also handles key value pairs as well (assuming
> that is what you want). In addition, it can be used as a Data Warehouse as
> well.
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
>
> On 29 September 2016 at 16:49, Ali Akhtar <al...@gmail.com> wrote:
>>
>> The business use case is to read a user's data from a variety of different
>> services through their API, and then allowing the user to query that data,
>> on a per service basis, as well as an aggregate across all services.
>>
>> The way I'm considering doing it, is to do some basic ETL (drop all the
>> unnecessary fields, rename some fields into something more manageable, etc)
>> and then store the data in Cassandra / Postgres.
>>
>> Then, when the user wants to view a particular report, query the
>> respective table in Cassandra / Postgres. (select .. from data where user =
>> ? and date between <start> and <end> and some_field = ?)
>>
>> How will Spark Streaming help w/ aggregation? Couldn't the data be queried
>> from Cassandra / Postgres via the Kafka consumer and aggregated that way?
>>
>> On Thu, Sep 29, 2016 at 8:43 PM, Cody Koeninger <co...@koeninger.org>
>> wrote:
>>>
>>> No, direct stream in and of itself won't ensure an end-to-end
>>> guarantee, because it doesn't know anything about your output actions.
>>>
>>> You still need to do some work.  The point is having easy access to
>>> offsets for batches on a per-partition basis makes it easier to do
>>> that work, especially in conjunction with aggregation.
>>>
>>> On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma <de...@gmail.com>
>>> wrote:
>>> > If you use spark direct streams , it ensure end to end guarantee for
>>> > messages.
>>> >
>>> >
>>> > On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar <al...@gmail.com>
>>> > wrote:
>>> >>
>>> >> My concern with Postgres / Cassandra is only scalability. I will look
>>> >> further into Postgres horizontal scaling, thanks.
>>> >>
>>> >> Writes could be idempotent if done as upserts, otherwise updates will
>>> >> be
>>> >> idempotent but not inserts.
>>> >>
>>> >> Data should not be lost. The system should be as fault tolerant as
>>> >> possible.
>>> >>
>>> >> What's the advantage of using Spark for reading Kafka instead of
>>> >> direct
>>> >> Kafka consumers?
>>> >>
>>> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <co...@koeninger.org>
>>> >> wrote:
>>> >>>
>>> >>> I wouldn't give up the flexibility and maturity of a relational
>>> >>> database, unless you have a very specific use case.  I'm not trashing
>>> >>> cassandra, I've used cassandra, but if all I know is that you're
>>> >>> doing
>>> >>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>>> >>> aggregations without a lot of forethought.  If you're worried about
>>> >>> scaling, there are several options for horizontally scaling Postgres
>>> >>> in particular.  One of the current best from what I've worked with is
>>> >>> Citus.
>>> >>>
>>> >>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma
>>> >>> <de...@gmail.com>
>>> >>> wrote:
>>> >>> > Hi Cody
>>> >>> > Spark direct stream is just fine for this use case.
>>> >>> > But why postgres and not cassandra?
>>> >>> > Is there anything specific here that i may not be aware?
>>> >>> >
>>> >>> > Thanks
>>> >>> > Deepak
>>> >>> >
>>> >>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger
>>> >>> > <co...@koeninger.org>
>>> >>> > wrote:
>>> >>> >>
>>> >>> >> How are you going to handle etl failures?  Do you care about lost
>>> >>> >> /
>>> >>> >> duplicated data?  Are your writes idempotent?
>>> >>> >>
>>> >>> >> Absent any other information about the problem, I'd stay away from
>>> >>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>>> >>> >> feeding postgres.
>>> >>> >>
>>> >>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar
>>> >>> >> <al...@gmail.com>
>>> >>> >> wrote:
>>> >>> >> > Is there an advantage to that vs directly consuming from Kafka?
>>> >>> >> > Nothing
>>> >>> >> > is
>>> >>> >> > being done to the data except some light ETL and then storing it
>>> >>> >> > in
>>> >>> >> > Cassandra
>>> >>> >> >
>>> >>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma
>>> >>> >> > <de...@gmail.com>
>>> >>> >> > wrote:
>>> >>> >> >>
>>> >>> >> >> Its better you use spark's direct stream to ingest from kafka.
>>> >>> >> >>
>>> >>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar
>>> >>> >> >> <al...@gmail.com>
>>> >>> >> >> wrote:
>>> >>> >> >>>
>>> >>> >> >>> I don't think I need a different speed storage and batch
>>> >>> >> >>> storage.
>>> >>> >> >>> Just
>>> >>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>>> >>> >> >>> somewhere
>>> >>> >> >>> where
>>> >>> >> >>> the web UI can query it, seems like it will be enough.
>>> >>> >> >>>
>>> >>> >> >>> I'm thinking about:
>>> >>> >> >>>
>>> >>> >> >>> - Reading data from Kafka via Spark Streaming
>>> >>> >> >>> - Standardizing, then storing it in Cassandra
>>> >>> >> >>> - Querying Cassandra from the web ui
>>> >>> >> >>>
>>> >>> >> >>> That seems like it will work. My question now is whether to
>>> >>> >> >>> use
>>> >>> >> >>> Spark
>>> >>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>>> >>> >> >>>
>>> >>> >> >>>
>>> >>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>>> >>> >> >>> <mi...@gmail.com> wrote:
>>> >>> >> >>>>
>>> >>> >> >>>> - Spark Streaming to read data from Kafka
>>> >>> >> >>>> - Storing the data on HDFS using Flume
>>> >>> >> >>>>
>>> >>> >> >>>> You don't need Spark streaming to read data from Kafka and
>>> >>> >> >>>> store
>>> >>> >> >>>> on
>>> >>> >> >>>> HDFS. It is a waste of resources.
>>> >>> >> >>>>
>>> >>> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
>>> >>> >> >>>>
>>> >>> >> >>>> KafkaAgent.sources = kafka-sources
>>> >>> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>> >>> >> >>>>
>>> >>> >> >>>> That will be for your batch layer. To analyse you can
>>> >>> >> >>>> directly
>>> >>> >> >>>> read
>>> >>> >> >>>> from
>>> >>> >> >>>> hdfs files with Spark or simply store data in a database of
>>> >>> >> >>>> your
>>> >>> >> >>>> choice via
>>> >>> >> >>>> cron or something. Do not mix your batch layer with speed
>>> >>> >> >>>> layer.
>>> >>> >> >>>>
>>> >>> >> >>>> Your speed layer will ingest the same data directly from
>>> >>> >> >>>> Kafka
>>> >>> >> >>>> into
>>> >>> >> >>>> spark streaming and that will be  online or near real time
>>> >>> >> >>>> (defined
>>> >>> >> >>>> by your
>>> >>> >> >>>> window).
>>> >>> >> >>>>
>>> >>> >> >>>> Then you have a a serving layer to present data from both
>>> >>> >> >>>> speed
>>> >>> >> >>>> (the
>>> >>> >> >>>> one from SS) and batch layer.
>>> >>> >> >>>>
>>> >>> >> >>>> HTH
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>> Dr Mich Talebzadeh
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>> LinkedIn
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>> http://talebzadehmich.wordpress.com
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>> Disclaimer: Use it at your own risk. Any and all
>>> >>> >> >>>> responsibility
>>> >>> >> >>>> for
>>> >>> >> >>>> any
>>> >>> >> >>>> loss, damage or destruction of data or any other property
>>> >>> >> >>>> which
>>> >>> >> >>>> may
>>> >>> >> >>>> arise
>>> >>> >> >>>> from relying on this email's technical content is explicitly
>>> >>> >> >>>> disclaimed. The
>>> >>> >> >>>> author will in no case be liable for any monetary damages
>>> >>> >> >>>> arising
>>> >>> >> >>>> from such
>>> >>> >> >>>> loss, damage or destruction.
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>> On 29 September 2016 at 15:15, Ali Akhtar
>>> >>> >> >>>> <al...@gmail.com>
>>> >>> >> >>>> wrote:
>>> >>> >> >>>>>
>>> >>> >> >>>>> The web UI is actually the speed layer, it needs to be able
>>> >>> >> >>>>> to
>>> >>> >> >>>>> query
>>> >>> >> >>>>> the data online, and show the results in real-time.
>>> >>> >> >>>>>
>>> >>> >> >>>>> It also needs a custom front-end, so a system like Tableau
>>> >>> >> >>>>> can't
>>> >>> >> >>>>> be
>>> >>> >> >>>>> used, it must have a custom backend + front-end.
>>> >>> >> >>>>>
>>> >>> >> >>>>> Thanks for the recommendation of Flume. Do you think this
>>> >>> >> >>>>> will
>>> >>> >> >>>>> work:
>>> >>> >> >>>>>
>>> >>> >> >>>>> - Spark Streaming to read data from Kafka
>>> >>> >> >>>>> - Storing the data on HDFS using Flume
>>> >>> >> >>>>> - Using Spark to query the data in the backend of the web
>>> >>> >> >>>>> UI?
>>> >>> >> >>>>>
>>> >>> >> >>>>>
>>> >>> >> >>>>>
>>> >>> >> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
>>> >>> >> >>>>> <mi...@gmail.com> wrote:
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> You need a batch layer and a speed layer. Data from Kafka
>>> >>> >> >>>>>> can
>>> >>> >> >>>>>> be
>>> >>> >> >>>>>> stored on HDFS using flume.
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> -  Query this data to generate reports / analytics (There
>>> >>> >> >>>>>> will
>>> >>> >> >>>>>> be a
>>> >>> >> >>>>>> web UI which will be the front-end to the data, and will
>>> >>> >> >>>>>> show
>>> >>> >> >>>>>> the
>>> >>> >> >>>>>> reports)
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> This is basically batch layer and you need something like
>>> >>> >> >>>>>> Tableau
>>> >>> >> >>>>>> or
>>> >>> >> >>>>>> Zeppelin to query data
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> You will also need spark streaming to query data online for
>>> >>> >> >>>>>> speed
>>> >>> >> >>>>>> layer. That data could be stored in some transient fabric
>>> >>> >> >>>>>> like
>>> >>> >> >>>>>> ignite or
>>> >>> >> >>>>>> even druid.
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> HTH
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> Dr Mich Talebzadeh
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> LinkedIn
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> http://talebzadehmich.wordpress.com
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> Disclaimer: Use it at your own risk. Any and all
>>> >>> >> >>>>>> responsibility
>>> >>> >> >>>>>> for
>>> >>> >> >>>>>> any loss, damage or destruction of data or any other
>>> >>> >> >>>>>> property
>>> >>> >> >>>>>> which
>>> >>> >> >>>>>> may
>>> >>> >> >>>>>> arise from relying on this email's technical content is
>>> >>> >> >>>>>> explicitly
>>> >>> >> >>>>>> disclaimed. The author will in no case be liable for any
>>> >>> >> >>>>>> monetary
>>> >>> >> >>>>>> damages
>>> >>> >> >>>>>> arising from such loss, damage or destruction.
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar
>>> >>> >> >>>>>> <al...@gmail.com>
>>> >>> >> >>>>>> wrote:
>>> >>> >> >>>>>>>
>>> >>> >> >>>>>>> It needs to be able to scale to a very large amount of
>>> >>> >> >>>>>>> data,
>>> >>> >> >>>>>>> yes.
>>> >>> >> >>>>>>>
>>> >>> >> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
>>> >>> >> >>>>>>> <de...@gmail.com> wrote:
>>> >>> >> >>>>>>>>
>>> >>> >> >>>>>>>> What is the message inflow ?
>>> >>> >> >>>>>>>> If it's really high , definitely spark will be of great
>>> >>> >> >>>>>>>> use .
>>> >>> >> >>>>>>>>
>>> >>> >> >>>>>>>> Thanks
>>> >>> >> >>>>>>>> Deepak
>>> >>> >> >>>>>>>>
>>> >>> >> >>>>>>>>
>>> >>> >> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar"
>>> >>> >> >>>>>>>> <al...@gmail.com>
>>> >>> >> >>>>>>>> wrote:
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for
>>> >>> >> >>>>>>>>> ideas.
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and
>>> >>> >> >>>>>>>>> writing
>>> >>> >> >>>>>>>>> their
>>> >>> >> >>>>>>>>> raw data into Kafka.
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> I need to:
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> - Do ETL on the data, and standardize it.
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> - Store the standardized data somewhere (HBase /
>>> >>> >> >>>>>>>>> Cassandra /
>>> >>> >> >>>>>>>>> Raw
>>> >>> >> >>>>>>>>> HDFS / ElasticSearch / Postgres)
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> - Query this data to generate reports / analytics (There
>>> >>> >> >>>>>>>>> will be
>>> >>> >> >>>>>>>>> a
>>> >>> >> >>>>>>>>> web UI which will be the front-end to the data, and will
>>> >>> >> >>>>>>>>> show
>>> >>> >> >>>>>>>>> the reports)
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> Java is being used as the backend language for
>>> >>> >> >>>>>>>>> everything
>>> >>> >> >>>>>>>>> (backend
>>> >>> >> >>>>>>>>> of the web UI, as well as the ETL layer)
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> I'm considering:
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the
>>> >>> >> >>>>>>>>> ETL
>>> >>> >> >>>>>>>>> layer
>>> >>> >> >>>>>>>>> (receive raw data from Kafka, standardize & store it)
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the
>>> >>> >> >>>>>>>>> standardized
>>> >>> >> >>>>>>>>> data, and to allow queries
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> - In the backend of the web UI, I could either use Spark
>>> >>> >> >>>>>>>>> to
>>> >>> >> >>>>>>>>> run
>>> >>> >> >>>>>>>>> queries across the data (mostly filters), or directly
>>> >>> >> >>>>>>>>> run
>>> >>> >> >>>>>>>>> queries against
>>> >>> >> >>>>>>>>> Cassandra / HBase
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of
>>> >>> >> >>>>>>>>> these
>>> >>> >> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka
>>> >>> >> >>>>>>>>> consumers vs
>>> >>> >> >>>>>>>>> Spark for
>>> >>> >> >>>>>>>>> ETL, which persistent data store to use, and how to
>>> >>> >> >>>>>>>>> query
>>> >>> >> >>>>>>>>> that
>>> >>> >> >>>>>>>>> data store in
>>> >>> >> >>>>>>>>> the backend of the web UI, for displaying the reports).
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> Thanks.
>>> >>> >> >>>>>>>
>>> >>> >> >>>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>
>>> >>> >> >>>>
>>> >>> >> >>>
>>> >>> >> >>
>>> >>> >> >>
>>> >>> >> >>
>>> >>> >> >> --
>>> >>> >> >> Thanks
>>> >>> >> >> Deepak
>>> >>> >> >> www.bigdatabig.com
>>> >>> >> >> www.keosha.net
>>> >>> >> >
>>> >>> >> >
>>> >>> >>
>>> >>> >>
>>> >>> >> ---------------------------------------------------------------------
>>> >>> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>> >>> >>
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > --
>>> >>> > Thanks
>>> >>> > Deepak
>>> >>> > www.bigdatabig.com
>>> >>> > www.keosha.net
>>> >>
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Thanks
>>> > Deepak
>>> > www.bigdatabig.com
>>> > www.keosha.net
>>
>>
>

Re: Architecture recommendations for a tricky use case

Posted by Mich Talebzadeh <mi...@gmail.com>.

The way I see this, there are two things involved.


   1. Data ingestion through source to Kafka
   2. Date conversion and Storage ETL/ELT
   3. Presentation

Item 2 is the one that needs to be designed correctly. I presume raw data
has to confirm to some form of MDM that requires schema mapping etc before
putting into persistent storage (DB, HDFS etc). Which one to choose depends
on your volume of ingestion and your cluster size and complexity of data
conversion. Then your users will use some form of UI (Tableau, QlikView,
Zeppelin, direct SQL) to query data one way or other. Your users can
directly use UI like Tableau that offer in built analytics on SQL. Spark
sql offers the same). Your mileage varies according to your needs.

I still don't understand why writing to a transactional database with
locking and concurrency (read and writes) through JDBC will be fast for
this sort of data ingestion. If you ask me if I wanted to choose an RDBMS
to write to as my sink,I would use Oracle which offers the best locking and
concurrency among RDBMs and also handles key value pairs as well (assuming
that is what you want). In addition, it can be used as a Data Warehouse as
well.

HTH



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 16:49, Ali Akhtar <al...@gmail.com> wrote:

> The business use case is to read a user's data from a variety of different
> services through their API, and then allowing the user to query that data,
> on a per service basis, as well as an aggregate across all services.
>
> The way I'm considering doing it, is to do some basic ETL (drop all the
> unnecessary fields, rename some fields into something more manageable, etc)
> and then store the data in Cassandra / Postgres.
>
> Then, when the user wants to view a particular report, query the
> respective table in Cassandra / Postgres. (select .. from data where user =
> ? and date between <start> and <end> and some_field = ?)
>
> How will Spark Streaming help w/ aggregation? Couldn't the data be queried
> from Cassandra / Postgres via the Kafka consumer and aggregated that way?
>
> On Thu, Sep 29, 2016 at 8:43 PM, Cody Koeninger <co...@koeninger.org>
> wrote:
>
>> No, direct stream in and of itself won't ensure an end-to-end
>> guarantee, because it doesn't know anything about your output actions.
>>
>> You still need to do some work.  The point is having easy access to
>> offsets for batches on a per-partition basis makes it easier to do
>> that work, especially in conjunction with aggregation.
>>
>> On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma <de...@gmail.com>
>> wrote:
>> > If you use spark direct streams , it ensure end to end guarantee for
>> > messages.
>> >
>> >
>> > On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar <al...@gmail.com>
>> wrote:
>> >>
>> >> My concern with Postgres / Cassandra is only scalability. I will look
>> >> further into Postgres horizontal scaling, thanks.
>> >>
>> >> Writes could be idempotent if done as upserts, otherwise updates will
>> be
>> >> idempotent but not inserts.
>> >>
>> >> Data should not be lost. The system should be as fault tolerant as
>> >> possible.
>> >>
>> >> What's the advantage of using Spark for reading Kafka instead of direct
>> >> Kafka consumers?
>> >>
>> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <co...@koeninger.org>
>> >> wrote:
>> >>>
>> >>> I wouldn't give up the flexibility and maturity of a relational
>> >>> database, unless you have a very specific use case.  I'm not trashing
>> >>> cassandra, I've used cassandra, but if all I know is that you're doing
>> >>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>> >>> aggregations without a lot of forethought.  If you're worried about
>> >>> scaling, there are several options for horizontally scaling Postgres
>> >>> in particular.  One of the current best from what I've worked with is
>> >>> Citus.
>> >>>
>> >>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <
>> deepakmca05@gmail.com>
>> >>> wrote:
>> >>> > Hi Cody
>> >>> > Spark direct stream is just fine for this use case.
>> >>> > But why postgres and not cassandra?
>> >>> > Is there anything specific here that i may not be aware?
>> >>> >
>> >>> > Thanks
>> >>> > Deepak
>> >>> >
>> >>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <cody@koeninger.org
>> >
>> >>> > wrote:
>> >>> >>
>> >>> >> How are you going to handle etl failures?  Do you care about lost /
>> >>> >> duplicated data?  Are your writes idempotent?
>> >>> >>
>> >>> >> Absent any other information about the problem, I'd stay away from
>> >>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>> >>> >> feeding postgres.
>> >>> >>
>> >>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <ali.rac200@gmail.com
>> >
>> >>> >> wrote:
>> >>> >> > Is there an advantage to that vs directly consuming from Kafka?
>> >>> >> > Nothing
>> >>> >> > is
>> >>> >> > being done to the data except some light ETL and then storing it
>> in
>> >>> >> > Cassandra
>> >>> >> >
>> >>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma
>> >>> >> > <de...@gmail.com>
>> >>> >> > wrote:
>> >>> >> >>
>> >>> >> >> Its better you use spark's direct stream to ingest from kafka.
>> >>> >> >>
>> >>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <
>> ali.rac200@gmail.com>
>> >>> >> >> wrote:
>> >>> >> >>>
>> >>> >> >>> I don't think I need a different speed storage and batch
>> storage.
>> >>> >> >>> Just
>> >>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>> >>> >> >>> somewhere
>> >>> >> >>> where
>> >>> >> >>> the web UI can query it, seems like it will be enough.
>> >>> >> >>>
>> >>> >> >>> I'm thinking about:
>> >>> >> >>>
>> >>> >> >>> - Reading data from Kafka via Spark Streaming
>> >>> >> >>> - Standardizing, then storing it in Cassandra
>> >>> >> >>> - Querying Cassandra from the web ui
>> >>> >> >>>
>> >>> >> >>> That seems like it will work. My question now is whether to use
>> >>> >> >>> Spark
>> >>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>> >>> >> >>>
>> >>> >> >>>
>> >>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>> >>> >> >>> <mi...@gmail.com> wrote:
>> >>> >> >>>>
>> >>> >> >>>> - Spark Streaming to read data from Kafka
>> >>> >> >>>> - Storing the data on HDFS using Flume
>> >>> >> >>>>
>> >>> >> >>>> You don't need Spark streaming to read data from Kafka and
>> store
>> >>> >> >>>> on
>> >>> >> >>>> HDFS. It is a waste of resources.
>> >>> >> >>>>
>> >>> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
>> >>> >> >>>>
>> >>> >> >>>> KafkaAgent.sources = kafka-sources
>> >>> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>> >>> >> >>>>
>> >>> >> >>>> That will be for your batch layer. To analyse you can directly
>> >>> >> >>>> read
>> >>> >> >>>> from
>> >>> >> >>>> hdfs files with Spark or simply store data in a database of
>> your
>> >>> >> >>>> choice via
>> >>> >> >>>> cron or something. Do not mix your batch layer with speed
>> layer.
>> >>> >> >>>>
>> >>> >> >>>> Your speed layer will ingest the same data directly from Kafka
>> >>> >> >>>> into
>> >>> >> >>>> spark streaming and that will be  online or near real time
>> >>> >> >>>> (defined
>> >>> >> >>>> by your
>> >>> >> >>>> window).
>> >>> >> >>>>
>> >>> >> >>>> Then you have a a serving layer to present data from both
>> speed
>> >>> >> >>>> (the
>> >>> >> >>>> one from SS) and batch layer.
>> >>> >> >>>>
>> >>> >> >>>> HTH
>> >>> >> >>>>
>> >>> >> >>>>
>> >>> >> >>>>
>> >>> >> >>>>
>> >>> >> >>>> Dr Mich Talebzadeh
>> >>> >> >>>>
>> >>> >> >>>>
>> >>> >> >>>>
>> >>> >> >>>> LinkedIn
>> >>> >> >>>>
>> >>> >> >>>>
>> >>> >> >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJ
>> d6zP6AcPCCdOABUrV8Pw
>> >>> >> >>>>
>> >>> >> >>>>
>> >>> >> >>>>
>> >>> >> >>>> http://talebzadehmich.wordpress.com
>> >>> >> >>>>
>> >>> >> >>>>
>> >>> >> >>>> Disclaimer: Use it at your own risk. Any and all
>> responsibility
>> >>> >> >>>> for
>> >>> >> >>>> any
>> >>> >> >>>> loss, damage or destruction of data or any other property
>> which
>> >>> >> >>>> may
>> >>> >> >>>> arise
>> >>> >> >>>> from relying on this email's technical content is explicitly
>> >>> >> >>>> disclaimed. The
>> >>> >> >>>> author will in no case be liable for any monetary damages
>> arising
>> >>> >> >>>> from such
>> >>> >> >>>> loss, damage or destruction.
>> >>> >> >>>>
>> >>> >> >>>>
>> >>> >> >>>>
>> >>> >> >>>>
>> >>> >> >>>> On 29 September 2016 at 15:15, Ali Akhtar <
>> ali.rac200@gmail.com>
>> >>> >> >>>> wrote:
>> >>> >> >>>>>
>> >>> >> >>>>> The web UI is actually the speed layer, it needs to be able
>> to
>> >>> >> >>>>> query
>> >>> >> >>>>> the data online, and show the results in real-time.
>> >>> >> >>>>>
>> >>> >> >>>>> It also needs a custom front-end, so a system like Tableau
>> can't
>> >>> >> >>>>> be
>> >>> >> >>>>> used, it must have a custom backend + front-end.
>> >>> >> >>>>>
>> >>> >> >>>>> Thanks for the recommendation of Flume. Do you think this
>> will
>> >>> >> >>>>> work:
>> >>> >> >>>>>
>> >>> >> >>>>> - Spark Streaming to read data from Kafka
>> >>> >> >>>>> - Storing the data on HDFS using Flume
>> >>> >> >>>>> - Using Spark to query the data in the backend of the web UI?
>> >>> >> >>>>>
>> >>> >> >>>>>
>> >>> >> >>>>>
>> >>> >> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
>> >>> >> >>>>> <mi...@gmail.com> wrote:
>> >>> >> >>>>>>
>> >>> >> >>>>>> You need a batch layer and a speed layer. Data from Kafka
>> can
>> >>> >> >>>>>> be
>> >>> >> >>>>>> stored on HDFS using flume.
>> >>> >> >>>>>>
>> >>> >> >>>>>> -  Query this data to generate reports / analytics (There
>> will
>> >>> >> >>>>>> be a
>> >>> >> >>>>>> web UI which will be the front-end to the data, and will
>> show
>> >>> >> >>>>>> the
>> >>> >> >>>>>> reports)
>> >>> >> >>>>>>
>> >>> >> >>>>>> This is basically batch layer and you need something like
>> >>> >> >>>>>> Tableau
>> >>> >> >>>>>> or
>> >>> >> >>>>>> Zeppelin to query data
>> >>> >> >>>>>>
>> >>> >> >>>>>> You will also need spark streaming to query data online for
>> >>> >> >>>>>> speed
>> >>> >> >>>>>> layer. That data could be stored in some transient fabric
>> like
>> >>> >> >>>>>> ignite or
>> >>> >> >>>>>> even druid.
>> >>> >> >>>>>>
>> >>> >> >>>>>> HTH
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>> Dr Mich Talebzadeh
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>> LinkedIn
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>> https://www.linkedin.com/profi
>> le/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>> http://talebzadehmich.wordpress.com
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>> Disclaimer: Use it at your own risk. Any and all
>> responsibility
>> >>> >> >>>>>> for
>> >>> >> >>>>>> any loss, damage or destruction of data or any other
>> property
>> >>> >> >>>>>> which
>> >>> >> >>>>>> may
>> >>> >> >>>>>> arise from relying on this email's technical content is
>> >>> >> >>>>>> explicitly
>> >>> >> >>>>>> disclaimed. The author will in no case be liable for any
>> >>> >> >>>>>> monetary
>> >>> >> >>>>>> damages
>> >>> >> >>>>>> arising from such loss, damage or destruction.
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar
>> >>> >> >>>>>> <al...@gmail.com>
>> >>> >> >>>>>> wrote:
>> >>> >> >>>>>>>
>> >>> >> >>>>>>> It needs to be able to scale to a very large amount of
>> data,
>> >>> >> >>>>>>> yes.
>> >>> >> >>>>>>>
>> >>> >> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
>> >>> >> >>>>>>> <de...@gmail.com> wrote:
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>> What is the message inflow ?
>> >>> >> >>>>>>>> If it's really high , definitely spark will be of great
>> use .
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>> Thanks
>> >>> >> >>>>>>>> Deepak
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <ali.rac200@gmail.com
>> >
>> >>> >> >>>>>>>> wrote:
>> >>> >> >>>>>>>>>
>> >>> >> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for
>> >>> >> >>>>>>>>> ideas.
>> >>> >> >>>>>>>>>
>> >>> >> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and
>> >>> >> >>>>>>>>> writing
>> >>> >> >>>>>>>>> their
>> >>> >> >>>>>>>>> raw data into Kafka.
>> >>> >> >>>>>>>>>
>> >>> >> >>>>>>>>> I need to:
>> >>> >> >>>>>>>>>
>> >>> >> >>>>>>>>> - Do ETL on the data, and standardize it.
>> >>> >> >>>>>>>>>
>> >>> >> >>>>>>>>> - Store the standardized data somewhere (HBase /
>> Cassandra /
>> >>> >> >>>>>>>>> Raw
>> >>> >> >>>>>>>>> HDFS / ElasticSearch / Postgres)
>> >>> >> >>>>>>>>>
>> >>> >> >>>>>>>>> - Query this data to generate reports / analytics (There
>> >>> >> >>>>>>>>> will be
>> >>> >> >>>>>>>>> a
>> >>> >> >>>>>>>>> web UI which will be the front-end to the data, and will
>> >>> >> >>>>>>>>> show
>> >>> >> >>>>>>>>> the reports)
>> >>> >> >>>>>>>>>
>> >>> >> >>>>>>>>> Java is being used as the backend language for everything
>> >>> >> >>>>>>>>> (backend
>> >>> >> >>>>>>>>> of the web UI, as well as the ETL layer)
>> >>> >> >>>>>>>>>
>> >>> >> >>>>>>>>> I'm considering:
>> >>> >> >>>>>>>>>
>> >>> >> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the
>> ETL
>> >>> >> >>>>>>>>> layer
>> >>> >> >>>>>>>>> (receive raw data from Kafka, standardize & store it)
>> >>> >> >>>>>>>>>
>> >>> >> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the
>> >>> >> >>>>>>>>> standardized
>> >>> >> >>>>>>>>> data, and to allow queries
>> >>> >> >>>>>>>>>
>> >>> >> >>>>>>>>> - In the backend of the web UI, I could either use Spark
>> to
>> >>> >> >>>>>>>>> run
>> >>> >> >>>>>>>>> queries across the data (mostly filters), or directly run
>> >>> >> >>>>>>>>> queries against
>> >>> >> >>>>>>>>> Cassandra / HBase
>> >>> >> >>>>>>>>>
>> >>> >> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of
>> these
>> >>> >> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka
>> >>> >> >>>>>>>>> consumers vs
>> >>> >> >>>>>>>>> Spark for
>> >>> >> >>>>>>>>> ETL, which persistent data store to use, and how to query
>> >>> >> >>>>>>>>> that
>> >>> >> >>>>>>>>> data store in
>> >>> >> >>>>>>>>> the backend of the web UI, for displaying the reports).
>> >>> >> >>>>>>>>>
>> >>> >> >>>>>>>>>
>> >>> >> >>>>>>>>> Thanks.
>> >>> >> >>>>>>>
>> >>> >> >>>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>
>> >>> >> >>>>
>> >>> >> >>>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> --
>> >>> >> >> Thanks
>> >>> >> >> Deepak
>> >>> >> >> www.bigdatabig.com
>> >>> >> >> www.keosha.net
>> >>> >> >
>> >>> >> >
>> >>> >>
>> >>> >> ------------------------------------------------------------
>> ---------
>> >>> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> >>> >>
>> >>> >
>> >>> >
>> >>> >
>> >>> > --
>> >>> > Thanks
>> >>> > Deepak
>> >>> > www.bigdatabig.com
>> >>> > www.keosha.net
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> > Thanks
>> > Deepak
>> > www.bigdatabig.com
>> > www.keosha.net
>>
>
>

Re: Architecture recommendations for a tricky use case

Posted by Mich Talebzadeh <mi...@gmail.com>.

The way I see this, there are two things involved.


   1. Data ingestion through source to Kafka
   2. Date conversion and Storage ETL/ELT
   3. Presentation

Item 2 is the one that needs to be designed correctly. I presume raw data
has to confirm to some form of MDM that requires schema mapping etc before
putting into persistent storage (DB, HDFS etc). Which one to choose depends
on your volume of ingestion and your cluster size and complexity of data
conversion. Then your users will use some form of UI (Tableau, QlikView,
Zeppelin, direct SQL) to query data one way or other. Your users can
directly use UI like Tableau that offer in built analytics on SQL. Spark
sql offers the same). Your mileage varies according to your needs.

I still don't understand why writing to a transactional database with
locking and concurrency (read and writes) through JDBC will be fast for
this sort of data ingestion. If you ask me if I wanted to choose an RDBMS
to write to as my sink,I would use Oracle which offers the best locking and
concurrency among RDBMs and also handles key value pairs as well (assuming
that is what you want). In addition, it can be used as a Data Warehouse as
well.

HTH



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 16:49, Ali Akhtar <al...@gmail.com> wrote:

> The business use case is to read a user's data from a variety of different
> services through their API, and then allowing the user to query that data,
> on a per service basis, as well as an aggregate across all services.
>
> The way I'm considering doing it, is to do some basic ETL (drop all the
> unnecessary fields, rename some fields into something more manageable, etc)
> and then store the data in Cassandra / Postgres.
>
> Then, when the user wants to view a particular report, query the
> respective table in Cassandra / Postgres. (select .. from data where user =
> ? and date between <start> and <end> and some_field = ?)
>
> How will Spark Streaming help w/ aggregation? Couldn't the data be queried
> from Cassandra / Postgres via the Kafka consumer and aggregated that way?
>
> On Thu, Sep 29, 2016 at 8:43 PM, Cody Koeninger <co...@koeninger.org>
> wrote:
>
>> No, direct stream in and of itself won't ensure an end-to-end
>> guarantee, because it doesn't know anything about your output actions.
>>
>> You still need to do some work.  The point is having easy access to
>> offsets for batches on a per-partition basis makes it easier to do
>> that work, especially in conjunction with aggregation.
>>
>> On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma <de...@gmail.com>
>> wrote:
>> > If you use spark direct streams , it ensure end to end guarantee for
>> > messages.
>> >
>> >
>> > On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar <al...@gmail.com>
>> wrote:
>> >>
>> >> My concern with Postgres / Cassandra is only scalability. I will look
>> >> further into Postgres horizontal scaling, thanks.
>> >>
>> >> Writes could be idempotent if done as upserts, otherwise updates will
>> be
>> >> idempotent but not inserts.
>> >>
>> >> Data should not be lost. The system should be as fault tolerant as
>> >> possible.
>> >>
>> >> What's the advantage of using Spark for reading Kafka instead of direct
>> >> Kafka consumers?
>> >>
>> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <co...@koeninger.org>
>> >> wrote:
>> >>>
>> >>> I wouldn't give up the flexibility and maturity of a relational
>> >>> database, unless you have a very specific use case.  I'm not trashing
>> >>> cassandra, I've used cassandra, but if all I know is that you're doing
>> >>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>> >>> aggregations without a lot of forethought.  If you're worried about
>> >>> scaling, there are several options for horizontally scaling Postgres
>> >>> in particular.  One of the current best from what I've worked with is
>> >>> Citus.
>> >>>
>> >>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <
>> deepakmca05@gmail.com>
>> >>> wrote:
>> >>> > Hi Cody
>> >>> > Spark direct stream is just fine for this use case.
>> >>> > But why postgres and not cassandra?
>> >>> > Is there anything specific here that i may not be aware?
>> >>> >
>> >>> > Thanks
>> >>> > Deepak
>> >>> >
>> >>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <cody@koeninger.org
>> >
>> >>> > wrote:
>> >>> >>
>> >>> >> How are you going to handle etl failures?  Do you care about lost /
>> >>> >> duplicated data?  Are your writes idempotent?
>> >>> >>
>> >>> >> Absent any other information about the problem, I'd stay away from
>> >>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>> >>> >> feeding postgres.
>> >>> >>
>> >>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <ali.rac200@gmail.com
>> >
>> >>> >> wrote:
>> >>> >> > Is there an advantage to that vs directly consuming from Kafka?
>> >>> >> > Nothing
>> >>> >> > is
>> >>> >> > being done to the data except some light ETL and then storing it
>> in
>> >>> >> > Cassandra
>> >>> >> >
>> >>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma
>> >>> >> > <de...@gmail.com>
>> >>> >> > wrote:
>> >>> >> >>
>> >>> >> >> Its better you use spark's direct stream to ingest from kafka.
>> >>> >> >>
>> >>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <
>> ali.rac200@gmail.com>
>> >>> >> >> wrote:
>> >>> >> >>>
>> >>> >> >>> I don't think I need a different speed storage and batch
>> storage.
>> >>> >> >>> Just
>> >>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>> >>> >> >>> somewhere
>> >>> >> >>> where
>> >>> >> >>> the web UI can query it, seems like it will be enough.
>> >>> >> >>>
>> >>> >> >>> I'm thinking about:
>> >>> >> >>>
>> >>> >> >>> - Reading data from Kafka via Spark Streaming
>> >>> >> >>> - Standardizing, then storing it in Cassandra
>> >>> >> >>> - Querying Cassandra from the web ui
>> >>> >> >>>
>> >>> >> >>> That seems like it will work. My question now is whether to use
>> >>> >> >>> Spark
>> >>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>> >>> >> >>>
>> >>> >> >>>
>> >>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>> >>> >> >>> <mi...@gmail.com> wrote:
>> >>> >> >>>>
>> >>> >> >>>> - Spark Streaming to read data from Kafka
>> >>> >> >>>> - Storing the data on HDFS using Flume
>> >>> >> >>>>
>> >>> >> >>>> You don't need Spark streaming to read data from Kafka and
>> store
>> >>> >> >>>> on
>> >>> >> >>>> HDFS. It is a waste of resources.
>> >>> >> >>>>
>> >>> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
>> >>> >> >>>>
>> >>> >> >>>> KafkaAgent.sources = kafka-sources
>> >>> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>> >>> >> >>>>
>> >>> >> >>>> That will be for your batch layer. To analyse you can directly
>> >>> >> >>>> read
>> >>> >> >>>> from
>> >>> >> >>>> hdfs files with Spark or simply store data in a database of
>> your
>> >>> >> >>>> choice via
>> >>> >> >>>> cron or something. Do not mix your batch layer with speed
>> layer.
>> >>> >> >>>>
>> >>> >> >>>> Your speed layer will ingest the same data directly from Kafka
>> >>> >> >>>> into
>> >>> >> >>>> spark streaming and that will be  online or near real time
>> >>> >> >>>> (defined
>> >>> >> >>>> by your
>> >>> >> >>>> window).
>> >>> >> >>>>
>> >>> >> >>>> Then you have a a serving layer to present data from both
>> speed
>> >>> >> >>>> (the
>> >>> >> >>>> one from SS) and batch layer.
>> >>> >> >>>>
>> >>> >> >>>> HTH
>> >>> >> >>>>
>> >>> >> >>>>
>> >>> >> >>>>
>> >>> >> >>>>
>> >>> >> >>>> Dr Mich Talebzadeh
>> >>> >> >>>>
>> >>> >> >>>>
>> >>> >> >>>>
>> >>> >> >>>> LinkedIn
>> >>> >> >>>>
>> >>> >> >>>>
>> >>> >> >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJ
>> d6zP6AcPCCdOABUrV8Pw
>> >>> >> >>>>
>> >>> >> >>>>
>> >>> >> >>>>
>> >>> >> >>>> http://talebzadehmich.wordpress.com
>> >>> >> >>>>
>> >>> >> >>>>
>> >>> >> >>>> Disclaimer: Use it at your own risk. Any and all
>> responsibility
>> >>> >> >>>> for
>> >>> >> >>>> any
>> >>> >> >>>> loss, damage or destruction of data or any other property
>> which
>> >>> >> >>>> may
>> >>> >> >>>> arise
>> >>> >> >>>> from relying on this email's technical content is explicitly
>> >>> >> >>>> disclaimed. The
>> >>> >> >>>> author will in no case be liable for any monetary damages
>> arising
>> >>> >> >>>> from such
>> >>> >> >>>> loss, damage or destruction.
>> >>> >> >>>>
>> >>> >> >>>>
>> >>> >> >>>>
>> >>> >> >>>>
>> >>> >> >>>> On 29 September 2016 at 15:15, Ali Akhtar <
>> ali.rac200@gmail.com>
>> >>> >> >>>> wrote:
>> >>> >> >>>>>
>> >>> >> >>>>> The web UI is actually the speed layer, it needs to be able
>> to
>> >>> >> >>>>> query
>> >>> >> >>>>> the data online, and show the results in real-time.
>> >>> >> >>>>>
>> >>> >> >>>>> It also needs a custom front-end, so a system like Tableau
>> can't
>> >>> >> >>>>> be
>> >>> >> >>>>> used, it must have a custom backend + front-end.
>> >>> >> >>>>>
>> >>> >> >>>>> Thanks for the recommendation of Flume. Do you think this
>> will
>> >>> >> >>>>> work:
>> >>> >> >>>>>
>> >>> >> >>>>> - Spark Streaming to read data from Kafka
>> >>> >> >>>>> - Storing the data on HDFS using Flume
>> >>> >> >>>>> - Using Spark to query the data in the backend of the web UI?
>> >>> >> >>>>>
>> >>> >> >>>>>
>> >>> >> >>>>>
>> >>> >> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
>> >>> >> >>>>> <mi...@gmail.com> wrote:
>> >>> >> >>>>>>
>> >>> >> >>>>>> You need a batch layer and a speed layer. Data from Kafka
>> can
>> >>> >> >>>>>> be
>> >>> >> >>>>>> stored on HDFS using flume.
>> >>> >> >>>>>>
>> >>> >> >>>>>> -  Query this data to generate reports / analytics (There
>> will
>> >>> >> >>>>>> be a
>> >>> >> >>>>>> web UI which will be the front-end to the data, and will
>> show
>> >>> >> >>>>>> the
>> >>> >> >>>>>> reports)
>> >>> >> >>>>>>
>> >>> >> >>>>>> This is basically batch layer and you need something like
>> >>> >> >>>>>> Tableau
>> >>> >> >>>>>> or
>> >>> >> >>>>>> Zeppelin to query data
>> >>> >> >>>>>>
>> >>> >> >>>>>> You will also need spark streaming to query data online for
>> >>> >> >>>>>> speed
>> >>> >> >>>>>> layer. That data could be stored in some transient fabric
>> like
>> >>> >> >>>>>> ignite or
>> >>> >> >>>>>> even druid.
>> >>> >> >>>>>>
>> >>> >> >>>>>> HTH
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>> Dr Mich Talebzadeh
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>> LinkedIn
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>> https://www.linkedin.com/profi
>> le/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>> http://talebzadehmich.wordpress.com
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>> Disclaimer: Use it at your own risk. Any and all
>> responsibility
>> >>> >> >>>>>> for
>> >>> >> >>>>>> any loss, damage or destruction of data or any other
>> property
>> >>> >> >>>>>> which
>> >>> >> >>>>>> may
>> >>> >> >>>>>> arise from relying on this email's technical content is
>> >>> >> >>>>>> explicitly
>> >>> >> >>>>>> disclaimed. The author will in no case be liable for any
>> >>> >> >>>>>> monetary
>> >>> >> >>>>>> damages
>> >>> >> >>>>>> arising from such loss, damage or destruction.
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar
>> >>> >> >>>>>> <al...@gmail.com>
>> >>> >> >>>>>> wrote:
>> >>> >> >>>>>>>
>> >>> >> >>>>>>> It needs to be able to scale to a very large amount of
>> data,
>> >>> >> >>>>>>> yes.
>> >>> >> >>>>>>>
>> >>> >> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
>> >>> >> >>>>>>> <de...@gmail.com> wrote:
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>> What is the message inflow ?
>> >>> >> >>>>>>>> If it's really high , definitely spark will be of great
>> use .
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>> Thanks
>> >>> >> >>>>>>>> Deepak
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <ali.rac200@gmail.com
>> >
>> >>> >> >>>>>>>> wrote:
>> >>> >> >>>>>>>>>
>> >>> >> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for
>> >>> >> >>>>>>>>> ideas.
>> >>> >> >>>>>>>>>
>> >>> >> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and
>> >>> >> >>>>>>>>> writing
>> >>> >> >>>>>>>>> their
>> >>> >> >>>>>>>>> raw data into Kafka.
>> >>> >> >>>>>>>>>
>> >>> >> >>>>>>>>> I need to:
>> >>> >> >>>>>>>>>
>> >>> >> >>>>>>>>> - Do ETL on the data, and standardize it.
>> >>> >> >>>>>>>>>
>> >>> >> >>>>>>>>> - Store the standardized data somewhere (HBase /
>> Cassandra /
>> >>> >> >>>>>>>>> Raw
>> >>> >> >>>>>>>>> HDFS / ElasticSearch / Postgres)
>> >>> >> >>>>>>>>>
>> >>> >> >>>>>>>>> - Query this data to generate reports / analytics (There
>> >>> >> >>>>>>>>> will be
>> >>> >> >>>>>>>>> a
>> >>> >> >>>>>>>>> web UI which will be the front-end to the data, and will
>> >>> >> >>>>>>>>> show
>> >>> >> >>>>>>>>> the reports)
>> >>> >> >>>>>>>>>
>> >>> >> >>>>>>>>> Java is being used as the backend language for everything
>> >>> >> >>>>>>>>> (backend
>> >>> >> >>>>>>>>> of the web UI, as well as the ETL layer)
>> >>> >> >>>>>>>>>
>> >>> >> >>>>>>>>> I'm considering:
>> >>> >> >>>>>>>>>
>> >>> >> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the
>> ETL
>> >>> >> >>>>>>>>> layer
>> >>> >> >>>>>>>>> (receive raw data from Kafka, standardize & store it)
>> >>> >> >>>>>>>>>
>> >>> >> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the
>> >>> >> >>>>>>>>> standardized
>> >>> >> >>>>>>>>> data, and to allow queries
>> >>> >> >>>>>>>>>
>> >>> >> >>>>>>>>> - In the backend of the web UI, I could either use Spark
>> to
>> >>> >> >>>>>>>>> run
>> >>> >> >>>>>>>>> queries across the data (mostly filters), or directly run
>> >>> >> >>>>>>>>> queries against
>> >>> >> >>>>>>>>> Cassandra / HBase
>> >>> >> >>>>>>>>>
>> >>> >> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of
>> these
>> >>> >> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka
>> >>> >> >>>>>>>>> consumers vs
>> >>> >> >>>>>>>>> Spark for
>> >>> >> >>>>>>>>> ETL, which persistent data store to use, and how to query
>> >>> >> >>>>>>>>> that
>> >>> >> >>>>>>>>> data store in
>> >>> >> >>>>>>>>> the backend of the web UI, for displaying the reports).
>> >>> >> >>>>>>>>>
>> >>> >> >>>>>>>>>
>> >>> >> >>>>>>>>> Thanks.
>> >>> >> >>>>>>>
>> >>> >> >>>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>
>> >>> >> >>>>
>> >>> >> >>>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> --
>> >>> >> >> Thanks
>> >>> >> >> Deepak
>> >>> >> >> www.bigdatabig.com
>> >>> >> >> www.keosha.net
>> >>> >> >
>> >>> >> >
>> >>> >>
>> >>> >> ------------------------------------------------------------
>> ---------
>> >>> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> >>> >>
>> >>> >
>> >>> >
>> >>> >
>> >>> > --
>> >>> > Thanks
>> >>> > Deepak
>> >>> > www.bigdatabig.com
>> >>> > www.keosha.net
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> > Thanks
>> > Deepak
>> > www.bigdatabig.com
>> > www.keosha.net
>>
>
>

Re: Architecture recommendations for a tricky use case

Posted by Ali Akhtar <al...@gmail.com>.

The business use case is to read a user's data from a variety of different
services through their API, and then allowing the user to query that data,
on a per service basis, as well as an aggregate across all services.

The way I'm considering doing it, is to do some basic ETL (drop all the
unnecessary fields, rename some fields into something more manageable, etc)
and then store the data in Cassandra / Postgres.

Then, when the user wants to view a particular report, query the respective
table in Cassandra / Postgres. (select .. from data where user = ? and date
between <start> and <end> and some_field = ?)

How will Spark Streaming help w/ aggregation? Couldn't the data be queried
from Cassandra / Postgres via the Kafka consumer and aggregated that way?

On Thu, Sep 29, 2016 at 8:43 PM, Cody Koeninger <co...@koeninger.org> wrote:

> No, direct stream in and of itself won't ensure an end-to-end
> guarantee, because it doesn't know anything about your output actions.
>
> You still need to do some work.  The point is having easy access to
> offsets for batches on a per-partition basis makes it easier to do
> that work, especially in conjunction with aggregation.
>
> On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma <de...@gmail.com>
> wrote:
> > If you use spark direct streams , it ensure end to end guarantee for
> > messages.
> >
> >
> > On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar <al...@gmail.com>
> wrote:
> >>
> >> My concern with Postgres / Cassandra is only scalability. I will look
> >> further into Postgres horizontal scaling, thanks.
> >>
> >> Writes could be idempotent if done as upserts, otherwise updates will be
> >> idempotent but not inserts.
> >>
> >> Data should not be lost. The system should be as fault tolerant as
> >> possible.
> >>
> >> What's the advantage of using Spark for reading Kafka instead of direct
> >> Kafka consumers?
> >>
> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <co...@koeninger.org>
> >> wrote:
> >>>
> >>> I wouldn't give up the flexibility and maturity of a relational
> >>> database, unless you have a very specific use case.  I'm not trashing
> >>> cassandra, I've used cassandra, but if all I know is that you're doing
> >>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
> >>> aggregations without a lot of forethought.  If you're worried about
> >>> scaling, there are several options for horizontally scaling Postgres
> >>> in particular.  One of the current best from what I've worked with is
> >>> Citus.
> >>>
> >>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <deepakmca05@gmail.com
> >
> >>> wrote:
> >>> > Hi Cody
> >>> > Spark direct stream is just fine for this use case.
> >>> > But why postgres and not cassandra?
> >>> > Is there anything specific here that i may not be aware?
> >>> >
> >>> > Thanks
> >>> > Deepak
> >>> >
> >>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <co...@koeninger.org>
> >>> > wrote:
> >>> >>
> >>> >> How are you going to handle etl failures?  Do you care about lost /
> >>> >> duplicated data?  Are your writes idempotent?
> >>> >>
> >>> >> Absent any other information about the problem, I'd stay away from
> >>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
> >>> >> feeding postgres.
> >>> >>
> >>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <al...@gmail.com>
> >>> >> wrote:
> >>> >> > Is there an advantage to that vs directly consuming from Kafka?
> >>> >> > Nothing
> >>> >> > is
> >>> >> > being done to the data except some light ETL and then storing it
> in
> >>> >> > Cassandra
> >>> >> >
> >>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma
> >>> >> > <de...@gmail.com>
> >>> >> > wrote:
> >>> >> >>
> >>> >> >> Its better you use spark's direct stream to ingest from kafka.
> >>> >> >>
> >>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <
> ali.rac200@gmail.com>
> >>> >> >> wrote:
> >>> >> >>>
> >>> >> >>> I don't think I need a different speed storage and batch
> storage.
> >>> >> >>> Just
> >>> >> >>> taking in raw data from Kafka, standardizing, and storing it
> >>> >> >>> somewhere
> >>> >> >>> where
> >>> >> >>> the web UI can query it, seems like it will be enough.
> >>> >> >>>
> >>> >> >>> I'm thinking about:
> >>> >> >>>
> >>> >> >>> - Reading data from Kafka via Spark Streaming
> >>> >> >>> - Standardizing, then storing it in Cassandra
> >>> >> >>> - Querying Cassandra from the web ui
> >>> >> >>>
> >>> >> >>> That seems like it will work. My question now is whether to use
> >>> >> >>> Spark
> >>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
> >>> >> >>>
> >>> >> >>>
> >>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
> >>> >> >>> <mi...@gmail.com> wrote:
> >>> >> >>>>
> >>> >> >>>> - Spark Streaming to read data from Kafka
> >>> >> >>>> - Storing the data on HDFS using Flume
> >>> >> >>>>
> >>> >> >>>> You don't need Spark streaming to read data from Kafka and
> store
> >>> >> >>>> on
> >>> >> >>>> HDFS. It is a waste of resources.
> >>> >> >>>>
> >>> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
> >>> >> >>>>
> >>> >> >>>> KafkaAgent.sources = kafka-sources
> >>> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
> >>> >> >>>>
> >>> >> >>>> That will be for your batch layer. To analyse you can directly
> >>> >> >>>> read
> >>> >> >>>> from
> >>> >> >>>> hdfs files with Spark or simply store data in a database of
> your
> >>> >> >>>> choice via
> >>> >> >>>> cron or something. Do not mix your batch layer with speed
> layer.
> >>> >> >>>>
> >>> >> >>>> Your speed layer will ingest the same data directly from Kafka
> >>> >> >>>> into
> >>> >> >>>> spark streaming and that will be  online or near real time
> >>> >> >>>> (defined
> >>> >> >>>> by your
> >>> >> >>>> window).
> >>> >> >>>>
> >>> >> >>>> Then you have a a serving layer to present data from both speed
> >>> >> >>>> (the
> >>> >> >>>> one from SS) and batch layer.
> >>> >> >>>>
> >>> >> >>>> HTH
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>> Dr Mich Talebzadeh
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>> LinkedIn
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>> https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>> http://talebzadehmich.wordpress.com
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>> Disclaimer: Use it at your own risk. Any and all responsibility
> >>> >> >>>> for
> >>> >> >>>> any
> >>> >> >>>> loss, damage or destruction of data or any other property which
> >>> >> >>>> may
> >>> >> >>>> arise
> >>> >> >>>> from relying on this email's technical content is explicitly
> >>> >> >>>> disclaimed. The
> >>> >> >>>> author will in no case be liable for any monetary damages
> arising
> >>> >> >>>> from such
> >>> >> >>>> loss, damage or destruction.
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>> On 29 September 2016 at 15:15, Ali Akhtar <
> ali.rac200@gmail.com>
> >>> >> >>>> wrote:
> >>> >> >>>>>
> >>> >> >>>>> The web UI is actually the speed layer, it needs to be able to
> >>> >> >>>>> query
> >>> >> >>>>> the data online, and show the results in real-time.
> >>> >> >>>>>
> >>> >> >>>>> It also needs a custom front-end, so a system like Tableau
> can't
> >>> >> >>>>> be
> >>> >> >>>>> used, it must have a custom backend + front-end.
> >>> >> >>>>>
> >>> >> >>>>> Thanks for the recommendation of Flume. Do you think this will
> >>> >> >>>>> work:
> >>> >> >>>>>
> >>> >> >>>>> - Spark Streaming to read data from Kafka
> >>> >> >>>>> - Storing the data on HDFS using Flume
> >>> >> >>>>> - Using Spark to query the data in the backend of the web UI?
> >>> >> >>>>>
> >>> >> >>>>>
> >>> >> >>>>>
> >>> >> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
> >>> >> >>>>> <mi...@gmail.com> wrote:
> >>> >> >>>>>>
> >>> >> >>>>>> You need a batch layer and a speed layer. Data from Kafka can
> >>> >> >>>>>> be
> >>> >> >>>>>> stored on HDFS using flume.
> >>> >> >>>>>>
> >>> >> >>>>>> -  Query this data to generate reports / analytics (There
> will
> >>> >> >>>>>> be a
> >>> >> >>>>>> web UI which will be the front-end to the data, and will show
> >>> >> >>>>>> the
> >>> >> >>>>>> reports)
> >>> >> >>>>>>
> >>> >> >>>>>> This is basically batch layer and you need something like
> >>> >> >>>>>> Tableau
> >>> >> >>>>>> or
> >>> >> >>>>>> Zeppelin to query data
> >>> >> >>>>>>
> >>> >> >>>>>> You will also need spark streaming to query data online for
> >>> >> >>>>>> speed
> >>> >> >>>>>> layer. That data could be stored in some transient fabric
> like
> >>> >> >>>>>> ignite or
> >>> >> >>>>>> even druid.
> >>> >> >>>>>>
> >>> >> >>>>>> HTH
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>> Dr Mich Talebzadeh
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>> LinkedIn
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>> https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>> http://talebzadehmich.wordpress.com
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>> Disclaimer: Use it at your own risk. Any and all
> responsibility
> >>> >> >>>>>> for
> >>> >> >>>>>> any loss, damage or destruction of data or any other property
> >>> >> >>>>>> which
> >>> >> >>>>>> may
> >>> >> >>>>>> arise from relying on this email's technical content is
> >>> >> >>>>>> explicitly
> >>> >> >>>>>> disclaimed. The author will in no case be liable for any
> >>> >> >>>>>> monetary
> >>> >> >>>>>> damages
> >>> >> >>>>>> arising from such loss, damage or destruction.
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar
> >>> >> >>>>>> <al...@gmail.com>
> >>> >> >>>>>> wrote:
> >>> >> >>>>>>>
> >>> >> >>>>>>> It needs to be able to scale to a very large amount of data,
> >>> >> >>>>>>> yes.
> >>> >> >>>>>>>
> >>> >> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
> >>> >> >>>>>>> <de...@gmail.com> wrote:
> >>> >> >>>>>>>>
> >>> >> >>>>>>>> What is the message inflow ?
> >>> >> >>>>>>>> If it's really high , definitely spark will be of great
> use .
> >>> >> >>>>>>>>
> >>> >> >>>>>>>> Thanks
> >>> >> >>>>>>>> Deepak
> >>> >> >>>>>>>>
> >>> >> >>>>>>>>
> >>> >> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com>
> >>> >> >>>>>>>> wrote:
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for
> >>> >> >>>>>>>>> ideas.
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and
> >>> >> >>>>>>>>> writing
> >>> >> >>>>>>>>> their
> >>> >> >>>>>>>>> raw data into Kafka.
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> I need to:
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> - Do ETL on the data, and standardize it.
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> - Store the standardized data somewhere (HBase /
> Cassandra /
> >>> >> >>>>>>>>> Raw
> >>> >> >>>>>>>>> HDFS / ElasticSearch / Postgres)
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> - Query this data to generate reports / analytics (There
> >>> >> >>>>>>>>> will be
> >>> >> >>>>>>>>> a
> >>> >> >>>>>>>>> web UI which will be the front-end to the data, and will
> >>> >> >>>>>>>>> show
> >>> >> >>>>>>>>> the reports)
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> Java is being used as the backend language for everything
> >>> >> >>>>>>>>> (backend
> >>> >> >>>>>>>>> of the web UI, as well as the ETL layer)
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> I'm considering:
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the
> ETL
> >>> >> >>>>>>>>> layer
> >>> >> >>>>>>>>> (receive raw data from Kafka, standardize & store it)
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the
> >>> >> >>>>>>>>> standardized
> >>> >> >>>>>>>>> data, and to allow queries
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> - In the backend of the web UI, I could either use Spark
> to
> >>> >> >>>>>>>>> run
> >>> >> >>>>>>>>> queries across the data (mostly filters), or directly run
> >>> >> >>>>>>>>> queries against
> >>> >> >>>>>>>>> Cassandra / HBase
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of
> these
> >>> >> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka
> >>> >> >>>>>>>>> consumers vs
> >>> >> >>>>>>>>> Spark for
> >>> >> >>>>>>>>> ETL, which persistent data store to use, and how to query
> >>> >> >>>>>>>>> that
> >>> >> >>>>>>>>> data store in
> >>> >> >>>>>>>>> the backend of the web UI, for displaying the reports).
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> Thanks.
> >>> >> >>>>>>>
> >>> >> >>>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>
> >>> >> >>>>
> >>> >> >>>
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >> --
> >>> >> >> Thanks
> >>> >> >> Deepak
> >>> >> >> www.bigdatabig.com
> >>> >> >> www.keosha.net
> >>> >> >
> >>> >> >
> >>> >>
> >>> >> ------------------------------------------------------------
> ---------
> >>> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >>> >>
> >>> >
> >>> >
> >>> >
> >>> > --
> >>> > Thanks
> >>> > Deepak
> >>> > www.bigdatabig.com
> >>> > www.keosha.net
> >>
> >>
> >
> >
> >
> > --
> > Thanks
> > Deepak
> > www.bigdatabig.com
> > www.keosha.net
>

Re: Architecture recommendations for a tricky use case

Posted by Ali Akhtar <al...@gmail.com>.

The business use case is to read a user's data from a variety of different
services through their API, and then allowing the user to query that data,
on a per service basis, as well as an aggregate across all services.

The way I'm considering doing it, is to do some basic ETL (drop all the
unnecessary fields, rename some fields into something more manageable, etc)
and then store the data in Cassandra / Postgres.

Then, when the user wants to view a particular report, query the respective
table in Cassandra / Postgres. (select .. from data where user = ? and date
between <start> and <end> and some_field = ?)

How will Spark Streaming help w/ aggregation? Couldn't the data be queried
from Cassandra / Postgres via the Kafka consumer and aggregated that way?

On Thu, Sep 29, 2016 at 8:43 PM, Cody Koeninger <co...@koeninger.org> wrote:

> No, direct stream in and of itself won't ensure an end-to-end
> guarantee, because it doesn't know anything about your output actions.
>
> You still need to do some work.  The point is having easy access to
> offsets for batches on a per-partition basis makes it easier to do
> that work, especially in conjunction with aggregation.
>
> On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma <de...@gmail.com>
> wrote:
> > If you use spark direct streams , it ensure end to end guarantee for
> > messages.
> >
> >
> > On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar <al...@gmail.com>
> wrote:
> >>
> >> My concern with Postgres / Cassandra is only scalability. I will look
> >> further into Postgres horizontal scaling, thanks.
> >>
> >> Writes could be idempotent if done as upserts, otherwise updates will be
> >> idempotent but not inserts.
> >>
> >> Data should not be lost. The system should be as fault tolerant as
> >> possible.
> >>
> >> What's the advantage of using Spark for reading Kafka instead of direct
> >> Kafka consumers?
> >>
> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <co...@koeninger.org>
> >> wrote:
> >>>
> >>> I wouldn't give up the flexibility and maturity of a relational
> >>> database, unless you have a very specific use case.  I'm not trashing
> >>> cassandra, I've used cassandra, but if all I know is that you're doing
> >>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
> >>> aggregations without a lot of forethought.  If you're worried about
> >>> scaling, there are several options for horizontally scaling Postgres
> >>> in particular.  One of the current best from what I've worked with is
> >>> Citus.
> >>>
> >>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <deepakmca05@gmail.com
> >
> >>> wrote:
> >>> > Hi Cody
> >>> > Spark direct stream is just fine for this use case.
> >>> > But why postgres and not cassandra?
> >>> > Is there anything specific here that i may not be aware?
> >>> >
> >>> > Thanks
> >>> > Deepak
> >>> >
> >>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <co...@koeninger.org>
> >>> > wrote:
> >>> >>
> >>> >> How are you going to handle etl failures?  Do you care about lost /
> >>> >> duplicated data?  Are your writes idempotent?
> >>> >>
> >>> >> Absent any other information about the problem, I'd stay away from
> >>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
> >>> >> feeding postgres.
> >>> >>
> >>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <al...@gmail.com>
> >>> >> wrote:
> >>> >> > Is there an advantage to that vs directly consuming from Kafka?
> >>> >> > Nothing
> >>> >> > is
> >>> >> > being done to the data except some light ETL and then storing it
> in
> >>> >> > Cassandra
> >>> >> >
> >>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma
> >>> >> > <de...@gmail.com>
> >>> >> > wrote:
> >>> >> >>
> >>> >> >> Its better you use spark's direct stream to ingest from kafka.
> >>> >> >>
> >>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <
> ali.rac200@gmail.com>
> >>> >> >> wrote:
> >>> >> >>>
> >>> >> >>> I don't think I need a different speed storage and batch
> storage.
> >>> >> >>> Just
> >>> >> >>> taking in raw data from Kafka, standardizing, and storing it
> >>> >> >>> somewhere
> >>> >> >>> where
> >>> >> >>> the web UI can query it, seems like it will be enough.
> >>> >> >>>
> >>> >> >>> I'm thinking about:
> >>> >> >>>
> >>> >> >>> - Reading data from Kafka via Spark Streaming
> >>> >> >>> - Standardizing, then storing it in Cassandra
> >>> >> >>> - Querying Cassandra from the web ui
> >>> >> >>>
> >>> >> >>> That seems like it will work. My question now is whether to use
> >>> >> >>> Spark
> >>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
> >>> >> >>>
> >>> >> >>>
> >>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
> >>> >> >>> <mi...@gmail.com> wrote:
> >>> >> >>>>
> >>> >> >>>> - Spark Streaming to read data from Kafka
> >>> >> >>>> - Storing the data on HDFS using Flume
> >>> >> >>>>
> >>> >> >>>> You don't need Spark streaming to read data from Kafka and
> store
> >>> >> >>>> on
> >>> >> >>>> HDFS. It is a waste of resources.
> >>> >> >>>>
> >>> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
> >>> >> >>>>
> >>> >> >>>> KafkaAgent.sources = kafka-sources
> >>> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
> >>> >> >>>>
> >>> >> >>>> That will be for your batch layer. To analyse you can directly
> >>> >> >>>> read
> >>> >> >>>> from
> >>> >> >>>> hdfs files with Spark or simply store data in a database of
> your
> >>> >> >>>> choice via
> >>> >> >>>> cron or something. Do not mix your batch layer with speed
> layer.
> >>> >> >>>>
> >>> >> >>>> Your speed layer will ingest the same data directly from Kafka
> >>> >> >>>> into
> >>> >> >>>> spark streaming and that will be  online or near real time
> >>> >> >>>> (defined
> >>> >> >>>> by your
> >>> >> >>>> window).
> >>> >> >>>>
> >>> >> >>>> Then you have a a serving layer to present data from both speed
> >>> >> >>>> (the
> >>> >> >>>> one from SS) and batch layer.
> >>> >> >>>>
> >>> >> >>>> HTH
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>> Dr Mich Talebzadeh
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>> LinkedIn
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>> https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>> http://talebzadehmich.wordpress.com
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>> Disclaimer: Use it at your own risk. Any and all responsibility
> >>> >> >>>> for
> >>> >> >>>> any
> >>> >> >>>> loss, damage or destruction of data or any other property which
> >>> >> >>>> may
> >>> >> >>>> arise
> >>> >> >>>> from relying on this email's technical content is explicitly
> >>> >> >>>> disclaimed. The
> >>> >> >>>> author will in no case be liable for any monetary damages
> arising
> >>> >> >>>> from such
> >>> >> >>>> loss, damage or destruction.
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>> On 29 September 2016 at 15:15, Ali Akhtar <
> ali.rac200@gmail.com>
> >>> >> >>>> wrote:
> >>> >> >>>>>
> >>> >> >>>>> The web UI is actually the speed layer, it needs to be able to
> >>> >> >>>>> query
> >>> >> >>>>> the data online, and show the results in real-time.
> >>> >> >>>>>
> >>> >> >>>>> It also needs a custom front-end, so a system like Tableau
> can't
> >>> >> >>>>> be
> >>> >> >>>>> used, it must have a custom backend + front-end.
> >>> >> >>>>>
> >>> >> >>>>> Thanks for the recommendation of Flume. Do you think this will
> >>> >> >>>>> work:
> >>> >> >>>>>
> >>> >> >>>>> - Spark Streaming to read data from Kafka
> >>> >> >>>>> - Storing the data on HDFS using Flume
> >>> >> >>>>> - Using Spark to query the data in the backend of the web UI?
> >>> >> >>>>>
> >>> >> >>>>>
> >>> >> >>>>>
> >>> >> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
> >>> >> >>>>> <mi...@gmail.com> wrote:
> >>> >> >>>>>>
> >>> >> >>>>>> You need a batch layer and a speed layer. Data from Kafka can
> >>> >> >>>>>> be
> >>> >> >>>>>> stored on HDFS using flume.
> >>> >> >>>>>>
> >>> >> >>>>>> -  Query this data to generate reports / analytics (There
> will
> >>> >> >>>>>> be a
> >>> >> >>>>>> web UI which will be the front-end to the data, and will show
> >>> >> >>>>>> the
> >>> >> >>>>>> reports)
> >>> >> >>>>>>
> >>> >> >>>>>> This is basically batch layer and you need something like
> >>> >> >>>>>> Tableau
> >>> >> >>>>>> or
> >>> >> >>>>>> Zeppelin to query data
> >>> >> >>>>>>
> >>> >> >>>>>> You will also need spark streaming to query data online for
> >>> >> >>>>>> speed
> >>> >> >>>>>> layer. That data could be stored in some transient fabric
> like
> >>> >> >>>>>> ignite or
> >>> >> >>>>>> even druid.
> >>> >> >>>>>>
> >>> >> >>>>>> HTH
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>> Dr Mich Talebzadeh
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>> LinkedIn
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>> https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>> http://talebzadehmich.wordpress.com
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>> Disclaimer: Use it at your own risk. Any and all
> responsibility
> >>> >> >>>>>> for
> >>> >> >>>>>> any loss, damage or destruction of data or any other property
> >>> >> >>>>>> which
> >>> >> >>>>>> may
> >>> >> >>>>>> arise from relying on this email's technical content is
> >>> >> >>>>>> explicitly
> >>> >> >>>>>> disclaimed. The author will in no case be liable for any
> >>> >> >>>>>> monetary
> >>> >> >>>>>> damages
> >>> >> >>>>>> arising from such loss, damage or destruction.
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar
> >>> >> >>>>>> <al...@gmail.com>
> >>> >> >>>>>> wrote:
> >>> >> >>>>>>>
> >>> >> >>>>>>> It needs to be able to scale to a very large amount of data,
> >>> >> >>>>>>> yes.
> >>> >> >>>>>>>
> >>> >> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
> >>> >> >>>>>>> <de...@gmail.com> wrote:
> >>> >> >>>>>>>>
> >>> >> >>>>>>>> What is the message inflow ?
> >>> >> >>>>>>>> If it's really high , definitely spark will be of great
> use .
> >>> >> >>>>>>>>
> >>> >> >>>>>>>> Thanks
> >>> >> >>>>>>>> Deepak
> >>> >> >>>>>>>>
> >>> >> >>>>>>>>
> >>> >> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com>
> >>> >> >>>>>>>> wrote:
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for
> >>> >> >>>>>>>>> ideas.
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and
> >>> >> >>>>>>>>> writing
> >>> >> >>>>>>>>> their
> >>> >> >>>>>>>>> raw data into Kafka.
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> I need to:
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> - Do ETL on the data, and standardize it.
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> - Store the standardized data somewhere (HBase /
> Cassandra /
> >>> >> >>>>>>>>> Raw
> >>> >> >>>>>>>>> HDFS / ElasticSearch / Postgres)
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> - Query this data to generate reports / analytics (There
> >>> >> >>>>>>>>> will be
> >>> >> >>>>>>>>> a
> >>> >> >>>>>>>>> web UI which will be the front-end to the data, and will
> >>> >> >>>>>>>>> show
> >>> >> >>>>>>>>> the reports)
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> Java is being used as the backend language for everything
> >>> >> >>>>>>>>> (backend
> >>> >> >>>>>>>>> of the web UI, as well as the ETL layer)
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> I'm considering:
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the
> ETL
> >>> >> >>>>>>>>> layer
> >>> >> >>>>>>>>> (receive raw data from Kafka, standardize & store it)
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the
> >>> >> >>>>>>>>> standardized
> >>> >> >>>>>>>>> data, and to allow queries
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> - In the backend of the web UI, I could either use Spark
> to
> >>> >> >>>>>>>>> run
> >>> >> >>>>>>>>> queries across the data (mostly filters), or directly run
> >>> >> >>>>>>>>> queries against
> >>> >> >>>>>>>>> Cassandra / HBase
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of
> these
> >>> >> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka
> >>> >> >>>>>>>>> consumers vs
> >>> >> >>>>>>>>> Spark for
> >>> >> >>>>>>>>> ETL, which persistent data store to use, and how to query
> >>> >> >>>>>>>>> that
> >>> >> >>>>>>>>> data store in
> >>> >> >>>>>>>>> the backend of the web UI, for displaying the reports).
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> Thanks.
> >>> >> >>>>>>>
> >>> >> >>>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>
> >>> >> >>>>
> >>> >> >>>
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >> --
> >>> >> >> Thanks
> >>> >> >> Deepak
> >>> >> >> www.bigdatabig.com
> >>> >> >> www.keosha.net
> >>> >> >
> >>> >> >
> >>> >>
> >>> >> ------------------------------------------------------------
> ---------
> >>> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >>> >>
> >>> >
> >>> >
> >>> >
> >>> > --
> >>> > Thanks
> >>> > Deepak
> >>> > www.bigdatabig.com
> >>> > www.keosha.net
> >>
> >>
> >
> >
> >
> > --
> > Thanks
> > Deepak
> > www.bigdatabig.com
> > www.keosha.net
>

Re: Architecture recommendations for a tricky use case

Posted by Cody Koeninger <co...@koeninger.org>.

No, direct stream in and of itself won't ensure an end-to-end
guarantee, because it doesn't know anything about your output actions.

You still need to do some work.  The point is having easy access to
offsets for batches on a per-partition basis makes it easier to do
that work, especially in conjunction with aggregation.

On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma <de...@gmail.com> wrote:
> If you use spark direct streams , it ensure end to end guarantee for
> messages.
>
>
> On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar <al...@gmail.com> wrote:
>>
>> My concern with Postgres / Cassandra is only scalability. I will look
>> further into Postgres horizontal scaling, thanks.
>>
>> Writes could be idempotent if done as upserts, otherwise updates will be
>> idempotent but not inserts.
>>
>> Data should not be lost. The system should be as fault tolerant as
>> possible.
>>
>> What's the advantage of using Spark for reading Kafka instead of direct
>> Kafka consumers?
>>
>> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <co...@koeninger.org>
>> wrote:
>>>
>>> I wouldn't give up the flexibility and maturity of a relational
>>> database, unless you have a very specific use case.  I'm not trashing
>>> cassandra, I've used cassandra, but if all I know is that you're doing
>>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>>> aggregations without a lot of forethought.  If you're worried about
>>> scaling, there are several options for horizontally scaling Postgres
>>> in particular.  One of the current best from what I've worked with is
>>> Citus.
>>>
>>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <de...@gmail.com>
>>> wrote:
>>> > Hi Cody
>>> > Spark direct stream is just fine for this use case.
>>> > But why postgres and not cassandra?
>>> > Is there anything specific here that i may not be aware?
>>> >
>>> > Thanks
>>> > Deepak
>>> >
>>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <co...@koeninger.org>
>>> > wrote:
>>> >>
>>> >> How are you going to handle etl failures?  Do you care about lost /
>>> >> duplicated data?  Are your writes idempotent?
>>> >>
>>> >> Absent any other information about the problem, I'd stay away from
>>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>>> >> feeding postgres.
>>> >>
>>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <al...@gmail.com>
>>> >> wrote:
>>> >> > Is there an advantage to that vs directly consuming from Kafka?
>>> >> > Nothing
>>> >> > is
>>> >> > being done to the data except some light ETL and then storing it in
>>> >> > Cassandra
>>> >> >
>>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma
>>> >> > <de...@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> Its better you use spark's direct stream to ingest from kafka.
>>> >> >>
>>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <al...@gmail.com>
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> I don't think I need a different speed storage and batch storage.
>>> >> >>> Just
>>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>>> >> >>> somewhere
>>> >> >>> where
>>> >> >>> the web UI can query it, seems like it will be enough.
>>> >> >>>
>>> >> >>> I'm thinking about:
>>> >> >>>
>>> >> >>> - Reading data from Kafka via Spark Streaming
>>> >> >>> - Standardizing, then storing it in Cassandra
>>> >> >>> - Querying Cassandra from the web ui
>>> >> >>>
>>> >> >>> That seems like it will work. My question now is whether to use
>>> >> >>> Spark
>>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>>> >> >>>
>>> >> >>>
>>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>>> >> >>> <mi...@gmail.com> wrote:
>>> >> >>>>
>>> >> >>>> - Spark Streaming to read data from Kafka
>>> >> >>>> - Storing the data on HDFS using Flume
>>> >> >>>>
>>> >> >>>> You don't need Spark streaming to read data from Kafka and store
>>> >> >>>> on
>>> >> >>>> HDFS. It is a waste of resources.
>>> >> >>>>
>>> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
>>> >> >>>>
>>> >> >>>> KafkaAgent.sources = kafka-sources
>>> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>> >> >>>>
>>> >> >>>> That will be for your batch layer. To analyse you can directly
>>> >> >>>> read
>>> >> >>>> from
>>> >> >>>> hdfs files with Spark or simply store data in a database of your
>>> >> >>>> choice via
>>> >> >>>> cron or something. Do not mix your batch layer with speed layer.
>>> >> >>>>
>>> >> >>>> Your speed layer will ingest the same data directly from Kafka
>>> >> >>>> into
>>> >> >>>> spark streaming and that will be  online or near real time
>>> >> >>>> (defined
>>> >> >>>> by your
>>> >> >>>> window).
>>> >> >>>>
>>> >> >>>> Then you have a a serving layer to present data from both speed
>>> >> >>>> (the
>>> >> >>>> one from SS) and batch layer.
>>> >> >>>>
>>> >> >>>> HTH
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> Dr Mich Talebzadeh
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> LinkedIn
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> http://talebzadehmich.wordpress.com
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> Disclaimer: Use it at your own risk. Any and all responsibility
>>> >> >>>> for
>>> >> >>>> any
>>> >> >>>> loss, damage or destruction of data or any other property which
>>> >> >>>> may
>>> >> >>>> arise
>>> >> >>>> from relying on this email's technical content is explicitly
>>> >> >>>> disclaimed. The
>>> >> >>>> author will in no case be liable for any monetary damages arising
>>> >> >>>> from such
>>> >> >>>> loss, damage or destruction.
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com>
>>> >> >>>> wrote:
>>> >> >>>>>
>>> >> >>>>> The web UI is actually the speed layer, it needs to be able to
>>> >> >>>>> query
>>> >> >>>>> the data online, and show the results in real-time.
>>> >> >>>>>
>>> >> >>>>> It also needs a custom front-end, so a system like Tableau can't
>>> >> >>>>> be
>>> >> >>>>> used, it must have a custom backend + front-end.
>>> >> >>>>>
>>> >> >>>>> Thanks for the recommendation of Flume. Do you think this will
>>> >> >>>>> work:
>>> >> >>>>>
>>> >> >>>>> - Spark Streaming to read data from Kafka
>>> >> >>>>> - Storing the data on HDFS using Flume
>>> >> >>>>> - Using Spark to query the data in the backend of the web UI?
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
>>> >> >>>>> <mi...@gmail.com> wrote:
>>> >> >>>>>>
>>> >> >>>>>> You need a batch layer and a speed layer. Data from Kafka can
>>> >> >>>>>> be
>>> >> >>>>>> stored on HDFS using flume.
>>> >> >>>>>>
>>> >> >>>>>> -  Query this data to generate reports / analytics (There will
>>> >> >>>>>> be a
>>> >> >>>>>> web UI which will be the front-end to the data, and will show
>>> >> >>>>>> the
>>> >> >>>>>> reports)
>>> >> >>>>>>
>>> >> >>>>>> This is basically batch layer and you need something like
>>> >> >>>>>> Tableau
>>> >> >>>>>> or
>>> >> >>>>>> Zeppelin to query data
>>> >> >>>>>>
>>> >> >>>>>> You will also need spark streaming to query data online for
>>> >> >>>>>> speed
>>> >> >>>>>> layer. That data could be stored in some transient fabric like
>>> >> >>>>>> ignite or
>>> >> >>>>>> even druid.
>>> >> >>>>>>
>>> >> >>>>>> HTH
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> Dr Mich Talebzadeh
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> LinkedIn
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> http://talebzadehmich.wordpress.com
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility
>>> >> >>>>>> for
>>> >> >>>>>> any loss, damage or destruction of data or any other property
>>> >> >>>>>> which
>>> >> >>>>>> may
>>> >> >>>>>> arise from relying on this email's technical content is
>>> >> >>>>>> explicitly
>>> >> >>>>>> disclaimed. The author will in no case be liable for any
>>> >> >>>>>> monetary
>>> >> >>>>>> damages
>>> >> >>>>>> arising from such loss, damage or destruction.
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar
>>> >> >>>>>> <al...@gmail.com>
>>> >> >>>>>> wrote:
>>> >> >>>>>>>
>>> >> >>>>>>> It needs to be able to scale to a very large amount of data,
>>> >> >>>>>>> yes.
>>> >> >>>>>>>
>>> >> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
>>> >> >>>>>>> <de...@gmail.com> wrote:
>>> >> >>>>>>>>
>>> >> >>>>>>>> What is the message inflow ?
>>> >> >>>>>>>> If it's really high , definitely spark will be of great use .
>>> >> >>>>>>>>
>>> >> >>>>>>>> Thanks
>>> >> >>>>>>>> Deepak
>>> >> >>>>>>>>
>>> >> >>>>>>>>
>>> >> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com>
>>> >> >>>>>>>> wrote:
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for
>>> >> >>>>>>>>> ideas.
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and
>>> >> >>>>>>>>> writing
>>> >> >>>>>>>>> their
>>> >> >>>>>>>>> raw data into Kafka.
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> I need to:
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - Do ETL on the data, and standardize it.
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra /
>>> >> >>>>>>>>> Raw
>>> >> >>>>>>>>> HDFS / ElasticSearch / Postgres)
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - Query this data to generate reports / analytics (There
>>> >> >>>>>>>>> will be
>>> >> >>>>>>>>> a
>>> >> >>>>>>>>> web UI which will be the front-end to the data, and will
>>> >> >>>>>>>>> show
>>> >> >>>>>>>>> the reports)
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> Java is being used as the backend language for everything
>>> >> >>>>>>>>> (backend
>>> >> >>>>>>>>> of the web UI, as well as the ETL layer)
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> I'm considering:
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL
>>> >> >>>>>>>>> layer
>>> >> >>>>>>>>> (receive raw data from Kafka, standardize & store it)
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the
>>> >> >>>>>>>>> standardized
>>> >> >>>>>>>>> data, and to allow queries
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - In the backend of the web UI, I could either use Spark to
>>> >> >>>>>>>>> run
>>> >> >>>>>>>>> queries across the data (mostly filters), or directly run
>>> >> >>>>>>>>> queries against
>>> >> >>>>>>>>> Cassandra / HBase
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of these
>>> >> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka
>>> >> >>>>>>>>> consumers vs
>>> >> >>>>>>>>> Spark for
>>> >> >>>>>>>>> ETL, which persistent data store to use, and how to query
>>> >> >>>>>>>>> that
>>> >> >>>>>>>>> data store in
>>> >> >>>>>>>>> the backend of the web UI, for displaying the reports).
>>> >> >>>>>>>>>
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> Thanks.
>>> >> >>>>>>>
>>> >> >>>>>>>
>>> >> >>>>>>
>>> >> >>>>>
>>> >> >>>>
>>> >> >>>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >> Thanks
>>> >> >> Deepak
>>> >> >> www.bigdatabig.com
>>> >> >> www.keosha.net
>>> >> >
>>> >> >
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Thanks
>>> > Deepak
>>> > www.bigdatabig.com
>>> > www.keosha.net
>>
>>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Architecture recommendations for a tricky use case

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi Ali,

What is the business use case for this?

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 16:40, Deepak Sharma <de...@gmail.com> wrote:

> If you use spark direct streams , it ensure end to end guarantee for
> messages.
>
>
> On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar <al...@gmail.com> wrote:
>
>> My concern with Postgres / Cassandra is only scalability. I will look
>> further into Postgres horizontal scaling, thanks.
>>
>> Writes could be idempotent if done as upserts, otherwise updates will be
>> idempotent but not inserts.
>>
>> Data should not be lost. The system should be as fault tolerant as
>> possible.
>>
>> What's the advantage of using Spark for reading Kafka instead of direct
>> Kafka consumers?
>>
>> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <co...@koeninger.org>
>> wrote:
>>
>>> I wouldn't give up the flexibility and maturity of a relational
>>> database, unless you have a very specific use case.  I'm not trashing
>>> cassandra, I've used cassandra, but if all I know is that you're doing
>>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>>> aggregations without a lot of forethought.  If you're worried about
>>> scaling, there are several options for horizontally scaling Postgres
>>> in particular.  One of the current best from what I've worked with is
>>> Citus.
>>>
>>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <de...@gmail.com>
>>> wrote:
>>> > Hi Cody
>>> > Spark direct stream is just fine for this use case.
>>> > But why postgres and not cassandra?
>>> > Is there anything specific here that i may not be aware?
>>> >
>>> > Thanks
>>> > Deepak
>>> >
>>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <co...@koeninger.org>
>>> wrote:
>>> >>
>>> >> How are you going to handle etl failures?  Do you care about lost /
>>> >> duplicated data?  Are your writes idempotent?
>>> >>
>>> >> Absent any other information about the problem, I'd stay away from
>>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>>> >> feeding postgres.
>>> >>
>>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <al...@gmail.com>
>>> wrote:
>>> >> > Is there an advantage to that vs directly consuming from Kafka?
>>> Nothing
>>> >> > is
>>> >> > being done to the data except some light ETL and then storing it in
>>> >> > Cassandra
>>> >> >
>>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <
>>> deepakmca05@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> Its better you use spark's direct stream to ingest from kafka.
>>> >> >>
>>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <al...@gmail.com>
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> I don't think I need a different speed storage and batch storage.
>>> Just
>>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>>> somewhere
>>> >> >>> where
>>> >> >>> the web UI can query it, seems like it will be enough.
>>> >> >>>
>>> >> >>> I'm thinking about:
>>> >> >>>
>>> >> >>> - Reading data from Kafka via Spark Streaming
>>> >> >>> - Standardizing, then storing it in Cassandra
>>> >> >>> - Querying Cassandra from the web ui
>>> >> >>>
>>> >> >>> That seems like it will work. My question now is whether to use
>>> Spark
>>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>>> >> >>>
>>> >> >>>
>>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>>> >> >>> <mi...@gmail.com> wrote:
>>> >> >>>>
>>> >> >>>> - Spark Streaming to read data from Kafka
>>> >> >>>> - Storing the data on HDFS using Flume
>>> >> >>>>
>>> >> >>>> You don't need Spark streaming to read data from Kafka and store
>>> on
>>> >> >>>> HDFS. It is a waste of resources.
>>> >> >>>>
>>> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
>>> >> >>>>
>>> >> >>>> KafkaAgent.sources = kafka-sources
>>> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>> >> >>>>
>>> >> >>>> That will be for your batch layer. To analyse you can directly
>>> read
>>> >> >>>> from
>>> >> >>>> hdfs files with Spark or simply store data in a database of your
>>> >> >>>> choice via
>>> >> >>>> cron or something. Do not mix your batch layer with speed layer.
>>> >> >>>>
>>> >> >>>> Your speed layer will ingest the same data directly from Kafka
>>> into
>>> >> >>>> spark streaming and that will be  online or near real time
>>> (defined
>>> >> >>>> by your
>>> >> >>>> window).
>>> >> >>>>
>>> >> >>>> Then you have a a serving layer to present data from both speed
>>> (the
>>> >> >>>> one from SS) and batch layer.
>>> >> >>>>
>>> >> >>>> HTH
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> Dr Mich Talebzadeh
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> LinkedIn
>>> >> >>>>
>>> >> >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJ
>>> d6zP6AcPCCdOABUrV8Pw
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> http://talebzadehmich.wordpress.com
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> Disclaimer: Use it at your own risk. Any and all responsibility
>>> for
>>> >> >>>> any
>>> >> >>>> loss, damage or destruction of data or any other property which
>>> may
>>> >> >>>> arise
>>> >> >>>> from relying on this email's technical content is explicitly
>>> >> >>>> disclaimed. The
>>> >> >>>> author will in no case be liable for any monetary damages arising
>>> >> >>>> from such
>>> >> >>>> loss, damage or destruction.
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com>
>>> >> >>>> wrote:
>>> >> >>>>>
>>> >> >>>>> The web UI is actually the speed layer, it needs to be able to
>>> query
>>> >> >>>>> the data online, and show the results in real-time.
>>> >> >>>>>
>>> >> >>>>> It also needs a custom front-end, so a system like Tableau
>>> can't be
>>> >> >>>>> used, it must have a custom backend + front-end.
>>> >> >>>>>
>>> >> >>>>> Thanks for the recommendation of Flume. Do you think this will
>>> work:
>>> >> >>>>>
>>> >> >>>>> - Spark Streaming to read data from Kafka
>>> >> >>>>> - Storing the data on HDFS using Flume
>>> >> >>>>> - Using Spark to query the data in the backend of the web UI?
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
>>> >> >>>>> <mi...@gmail.com> wrote:
>>> >> >>>>>>
>>> >> >>>>>> You need a batch layer and a speed layer. Data from Kafka can
>>> be
>>> >> >>>>>> stored on HDFS using flume.
>>> >> >>>>>>
>>> >> >>>>>> -  Query this data to generate reports / analytics (There will
>>> be a
>>> >> >>>>>> web UI which will be the front-end to the data, and will show
>>> the
>>> >> >>>>>> reports)
>>> >> >>>>>>
>>> >> >>>>>> This is basically batch layer and you need something like
>>> Tableau
>>> >> >>>>>> or
>>> >> >>>>>> Zeppelin to query data
>>> >> >>>>>>
>>> >> >>>>>> You will also need spark streaming to query data online for
>>> speed
>>> >> >>>>>> layer. That data could be stored in some transient fabric like
>>> >> >>>>>> ignite or
>>> >> >>>>>> even druid.
>>> >> >>>>>>
>>> >> >>>>>> HTH
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> Dr Mich Talebzadeh
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> LinkedIn
>>> >> >>>>>>
>>> >> >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJ
>>> d6zP6AcPCCdOABUrV8Pw
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> http://talebzadehmich.wordpress.com
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> Disclaimer: Use it at your own risk. Any and all
>>> responsibility for
>>> >> >>>>>> any loss, damage or destruction of data or any other property
>>> which
>>> >> >>>>>> may
>>> >> >>>>>> arise from relying on this email's technical content is
>>> explicitly
>>> >> >>>>>> disclaimed. The author will in no case be liable for any
>>> monetary
>>> >> >>>>>> damages
>>> >> >>>>>> arising from such loss, damage or destruction.
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar <
>>> ali.rac200@gmail.com>
>>> >> >>>>>> wrote:
>>> >> >>>>>>>
>>> >> >>>>>>> It needs to be able to scale to a very large amount of data,
>>> yes.
>>> >> >>>>>>>
>>> >> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
>>> >> >>>>>>> <de...@gmail.com> wrote:
>>> >> >>>>>>>>
>>> >> >>>>>>>> What is the message inflow ?
>>> >> >>>>>>>> If it's really high , definitely spark will be of great use .
>>> >> >>>>>>>>
>>> >> >>>>>>>> Thanks
>>> >> >>>>>>>> Deepak
>>> >> >>>>>>>>
>>> >> >>>>>>>>
>>> >> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com>
>>> wrote:
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for
>>> ideas.
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and
>>> writing
>>> >> >>>>>>>>> their
>>> >> >>>>>>>>> raw data into Kafka.
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> I need to:
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - Do ETL on the data, and standardize it.
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra
>>> / Raw
>>> >> >>>>>>>>> HDFS / ElasticSearch / Postgres)
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - Query this data to generate reports / analytics (There
>>> will be
>>> >> >>>>>>>>> a
>>> >> >>>>>>>>> web UI which will be the front-end to the data, and will
>>> show
>>> >> >>>>>>>>> the reports)
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> Java is being used as the backend language for everything
>>> >> >>>>>>>>> (backend
>>> >> >>>>>>>>> of the web UI, as well as the ETL layer)
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> I'm considering:
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL
>>> >> >>>>>>>>> layer
>>> >> >>>>>>>>> (receive raw data from Kafka, standardize & store it)
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the
>>> >> >>>>>>>>> standardized
>>> >> >>>>>>>>> data, and to allow queries
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - In the backend of the web UI, I could either use Spark to
>>> run
>>> >> >>>>>>>>> queries across the data (mostly filters), or directly run
>>> >> >>>>>>>>> queries against
>>> >> >>>>>>>>> Cassandra / HBase
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of these
>>> >> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka
>>> consumers vs
>>> >> >>>>>>>>> Spark for
>>> >> >>>>>>>>> ETL, which persistent data store to use, and how to query
>>> that
>>> >> >>>>>>>>> data store in
>>> >> >>>>>>>>> the backend of the web UI, for displaying the reports).
>>> >> >>>>>>>>>
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> Thanks.
>>> >> >>>>>>>
>>> >> >>>>>>>
>>> >> >>>>>>
>>> >> >>>>>
>>> >> >>>>
>>> >> >>>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >> Thanks
>>> >> >> Deepak
>>> >> >> www.bigdatabig.com
>>> >> >> www.keosha.net
>>> >> >
>>> >> >
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Thanks
>>> > Deepak
>>> > www.bigdatabig.com
>>> > www.keosha.net
>>>
>>
>>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>

Re: Architecture recommendations for a tricky use case

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi Ali,

What is the business use case for this?

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 16:40, Deepak Sharma <de...@gmail.com> wrote:

> If you use spark direct streams , it ensure end to end guarantee for
> messages.
>
>
> On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar <al...@gmail.com> wrote:
>
>> My concern with Postgres / Cassandra is only scalability. I will look
>> further into Postgres horizontal scaling, thanks.
>>
>> Writes could be idempotent if done as upserts, otherwise updates will be
>> idempotent but not inserts.
>>
>> Data should not be lost. The system should be as fault tolerant as
>> possible.
>>
>> What's the advantage of using Spark for reading Kafka instead of direct
>> Kafka consumers?
>>
>> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <co...@koeninger.org>
>> wrote:
>>
>>> I wouldn't give up the flexibility and maturity of a relational
>>> database, unless you have a very specific use case.  I'm not trashing
>>> cassandra, I've used cassandra, but if all I know is that you're doing
>>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>>> aggregations without a lot of forethought.  If you're worried about
>>> scaling, there are several options for horizontally scaling Postgres
>>> in particular.  One of the current best from what I've worked with is
>>> Citus.
>>>
>>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <de...@gmail.com>
>>> wrote:
>>> > Hi Cody
>>> > Spark direct stream is just fine for this use case.
>>> > But why postgres and not cassandra?
>>> > Is there anything specific here that i may not be aware?
>>> >
>>> > Thanks
>>> > Deepak
>>> >
>>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <co...@koeninger.org>
>>> wrote:
>>> >>
>>> >> How are you going to handle etl failures?  Do you care about lost /
>>> >> duplicated data?  Are your writes idempotent?
>>> >>
>>> >> Absent any other information about the problem, I'd stay away from
>>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>>> >> feeding postgres.
>>> >>
>>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <al...@gmail.com>
>>> wrote:
>>> >> > Is there an advantage to that vs directly consuming from Kafka?
>>> Nothing
>>> >> > is
>>> >> > being done to the data except some light ETL and then storing it in
>>> >> > Cassandra
>>> >> >
>>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <
>>> deepakmca05@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> Its better you use spark's direct stream to ingest from kafka.
>>> >> >>
>>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <al...@gmail.com>
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> I don't think I need a different speed storage and batch storage.
>>> Just
>>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>>> somewhere
>>> >> >>> where
>>> >> >>> the web UI can query it, seems like it will be enough.
>>> >> >>>
>>> >> >>> I'm thinking about:
>>> >> >>>
>>> >> >>> - Reading data from Kafka via Spark Streaming
>>> >> >>> - Standardizing, then storing it in Cassandra
>>> >> >>> - Querying Cassandra from the web ui
>>> >> >>>
>>> >> >>> That seems like it will work. My question now is whether to use
>>> Spark
>>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>>> >> >>>
>>> >> >>>
>>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>>> >> >>> <mi...@gmail.com> wrote:
>>> >> >>>>
>>> >> >>>> - Spark Streaming to read data from Kafka
>>> >> >>>> - Storing the data on HDFS using Flume
>>> >> >>>>
>>> >> >>>> You don't need Spark streaming to read data from Kafka and store
>>> on
>>> >> >>>> HDFS. It is a waste of resources.
>>> >> >>>>
>>> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
>>> >> >>>>
>>> >> >>>> KafkaAgent.sources = kafka-sources
>>> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>> >> >>>>
>>> >> >>>> That will be for your batch layer. To analyse you can directly
>>> read
>>> >> >>>> from
>>> >> >>>> hdfs files with Spark or simply store data in a database of your
>>> >> >>>> choice via
>>> >> >>>> cron or something. Do not mix your batch layer with speed layer.
>>> >> >>>>
>>> >> >>>> Your speed layer will ingest the same data directly from Kafka
>>> into
>>> >> >>>> spark streaming and that will be  online or near real time
>>> (defined
>>> >> >>>> by your
>>> >> >>>> window).
>>> >> >>>>
>>> >> >>>> Then you have a a serving layer to present data from both speed
>>> (the
>>> >> >>>> one from SS) and batch layer.
>>> >> >>>>
>>> >> >>>> HTH
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> Dr Mich Talebzadeh
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> LinkedIn
>>> >> >>>>
>>> >> >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJ
>>> d6zP6AcPCCdOABUrV8Pw
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> http://talebzadehmich.wordpress.com
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> Disclaimer: Use it at your own risk. Any and all responsibility
>>> for
>>> >> >>>> any
>>> >> >>>> loss, damage or destruction of data or any other property which
>>> may
>>> >> >>>> arise
>>> >> >>>> from relying on this email's technical content is explicitly
>>> >> >>>> disclaimed. The
>>> >> >>>> author will in no case be liable for any monetary damages arising
>>> >> >>>> from such
>>> >> >>>> loss, damage or destruction.
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com>
>>> >> >>>> wrote:
>>> >> >>>>>
>>> >> >>>>> The web UI is actually the speed layer, it needs to be able to
>>> query
>>> >> >>>>> the data online, and show the results in real-time.
>>> >> >>>>>
>>> >> >>>>> It also needs a custom front-end, so a system like Tableau
>>> can't be
>>> >> >>>>> used, it must have a custom backend + front-end.
>>> >> >>>>>
>>> >> >>>>> Thanks for the recommendation of Flume. Do you think this will
>>> work:
>>> >> >>>>>
>>> >> >>>>> - Spark Streaming to read data from Kafka
>>> >> >>>>> - Storing the data on HDFS using Flume
>>> >> >>>>> - Using Spark to query the data in the backend of the web UI?
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
>>> >> >>>>> <mi...@gmail.com> wrote:
>>> >> >>>>>>
>>> >> >>>>>> You need a batch layer and a speed layer. Data from Kafka can
>>> be
>>> >> >>>>>> stored on HDFS using flume.
>>> >> >>>>>>
>>> >> >>>>>> -  Query this data to generate reports / analytics (There will
>>> be a
>>> >> >>>>>> web UI which will be the front-end to the data, and will show
>>> the
>>> >> >>>>>> reports)
>>> >> >>>>>>
>>> >> >>>>>> This is basically batch layer and you need something like
>>> Tableau
>>> >> >>>>>> or
>>> >> >>>>>> Zeppelin to query data
>>> >> >>>>>>
>>> >> >>>>>> You will also need spark streaming to query data online for
>>> speed
>>> >> >>>>>> layer. That data could be stored in some transient fabric like
>>> >> >>>>>> ignite or
>>> >> >>>>>> even druid.
>>> >> >>>>>>
>>> >> >>>>>> HTH
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> Dr Mich Talebzadeh
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> LinkedIn
>>> >> >>>>>>
>>> >> >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJ
>>> d6zP6AcPCCdOABUrV8Pw
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> http://talebzadehmich.wordpress.com
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> Disclaimer: Use it at your own risk. Any and all
>>> responsibility for
>>> >> >>>>>> any loss, damage or destruction of data or any other property
>>> which
>>> >> >>>>>> may
>>> >> >>>>>> arise from relying on this email's technical content is
>>> explicitly
>>> >> >>>>>> disclaimed. The author will in no case be liable for any
>>> monetary
>>> >> >>>>>> damages
>>> >> >>>>>> arising from such loss, damage or destruction.
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar <
>>> ali.rac200@gmail.com>
>>> >> >>>>>> wrote:
>>> >> >>>>>>>
>>> >> >>>>>>> It needs to be able to scale to a very large amount of data,
>>> yes.
>>> >> >>>>>>>
>>> >> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
>>> >> >>>>>>> <de...@gmail.com> wrote:
>>> >> >>>>>>>>
>>> >> >>>>>>>> What is the message inflow ?
>>> >> >>>>>>>> If it's really high , definitely spark will be of great use .
>>> >> >>>>>>>>
>>> >> >>>>>>>> Thanks
>>> >> >>>>>>>> Deepak
>>> >> >>>>>>>>
>>> >> >>>>>>>>
>>> >> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com>
>>> wrote:
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for
>>> ideas.
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and
>>> writing
>>> >> >>>>>>>>> their
>>> >> >>>>>>>>> raw data into Kafka.
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> I need to:
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - Do ETL on the data, and standardize it.
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra
>>> / Raw
>>> >> >>>>>>>>> HDFS / ElasticSearch / Postgres)
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - Query this data to generate reports / analytics (There
>>> will be
>>> >> >>>>>>>>> a
>>> >> >>>>>>>>> web UI which will be the front-end to the data, and will
>>> show
>>> >> >>>>>>>>> the reports)
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> Java is being used as the backend language for everything
>>> >> >>>>>>>>> (backend
>>> >> >>>>>>>>> of the web UI, as well as the ETL layer)
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> I'm considering:
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL
>>> >> >>>>>>>>> layer
>>> >> >>>>>>>>> (receive raw data from Kafka, standardize & store it)
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the
>>> >> >>>>>>>>> standardized
>>> >> >>>>>>>>> data, and to allow queries
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - In the backend of the web UI, I could either use Spark to
>>> run
>>> >> >>>>>>>>> queries across the data (mostly filters), or directly run
>>> >> >>>>>>>>> queries against
>>> >> >>>>>>>>> Cassandra / HBase
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of these
>>> >> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka
>>> consumers vs
>>> >> >>>>>>>>> Spark for
>>> >> >>>>>>>>> ETL, which persistent data store to use, and how to query
>>> that
>>> >> >>>>>>>>> data store in
>>> >> >>>>>>>>> the backend of the web UI, for displaying the reports).
>>> >> >>>>>>>>>
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> Thanks.
>>> >> >>>>>>>
>>> >> >>>>>>>
>>> >> >>>>>>
>>> >> >>>>>
>>> >> >>>>
>>> >> >>>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >> Thanks
>>> >> >> Deepak
>>> >> >> www.bigdatabig.com
>>> >> >> www.keosha.net
>>> >> >
>>> >> >
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Thanks
>>> > Deepak
>>> > www.bigdatabig.com
>>> > www.keosha.net
>>>
>>
>>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>

Re: Architecture recommendations for a tricky use case

Posted by Cody Koeninger <co...@koeninger.org>.

No, direct stream in and of itself won't ensure an end-to-end
guarantee, because it doesn't know anything about your output actions.

You still need to do some work.  The point is having easy access to
offsets for batches on a per-partition basis makes it easier to do
that work, especially in conjunction with aggregation.

On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma <de...@gmail.com> wrote:
> If you use spark direct streams , it ensure end to end guarantee for
> messages.
>
>
> On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar <al...@gmail.com> wrote:
>>
>> My concern with Postgres / Cassandra is only scalability. I will look
>> further into Postgres horizontal scaling, thanks.
>>
>> Writes could be idempotent if done as upserts, otherwise updates will be
>> idempotent but not inserts.
>>
>> Data should not be lost. The system should be as fault tolerant as
>> possible.
>>
>> What's the advantage of using Spark for reading Kafka instead of direct
>> Kafka consumers?
>>
>> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <co...@koeninger.org>
>> wrote:
>>>
>>> I wouldn't give up the flexibility and maturity of a relational
>>> database, unless you have a very specific use case.  I'm not trashing
>>> cassandra, I've used cassandra, but if all I know is that you're doing
>>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>>> aggregations without a lot of forethought.  If you're worried about
>>> scaling, there are several options for horizontally scaling Postgres
>>> in particular.  One of the current best from what I've worked with is
>>> Citus.
>>>
>>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <de...@gmail.com>
>>> wrote:
>>> > Hi Cody
>>> > Spark direct stream is just fine for this use case.
>>> > But why postgres and not cassandra?
>>> > Is there anything specific here that i may not be aware?
>>> >
>>> > Thanks
>>> > Deepak
>>> >
>>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <co...@koeninger.org>
>>> > wrote:
>>> >>
>>> >> How are you going to handle etl failures?  Do you care about lost /
>>> >> duplicated data?  Are your writes idempotent?
>>> >>
>>> >> Absent any other information about the problem, I'd stay away from
>>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>>> >> feeding postgres.
>>> >>
>>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <al...@gmail.com>
>>> >> wrote:
>>> >> > Is there an advantage to that vs directly consuming from Kafka?
>>> >> > Nothing
>>> >> > is
>>> >> > being done to the data except some light ETL and then storing it in
>>> >> > Cassandra
>>> >> >
>>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma
>>> >> > <de...@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> Its better you use spark's direct stream to ingest from kafka.
>>> >> >>
>>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <al...@gmail.com>
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> I don't think I need a different speed storage and batch storage.
>>> >> >>> Just
>>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>>> >> >>> somewhere
>>> >> >>> where
>>> >> >>> the web UI can query it, seems like it will be enough.
>>> >> >>>
>>> >> >>> I'm thinking about:
>>> >> >>>
>>> >> >>> - Reading data from Kafka via Spark Streaming
>>> >> >>> - Standardizing, then storing it in Cassandra
>>> >> >>> - Querying Cassandra from the web ui
>>> >> >>>
>>> >> >>> That seems like it will work. My question now is whether to use
>>> >> >>> Spark
>>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>>> >> >>>
>>> >> >>>
>>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>>> >> >>> <mi...@gmail.com> wrote:
>>> >> >>>>
>>> >> >>>> - Spark Streaming to read data from Kafka
>>> >> >>>> - Storing the data on HDFS using Flume
>>> >> >>>>
>>> >> >>>> You don't need Spark streaming to read data from Kafka and store
>>> >> >>>> on
>>> >> >>>> HDFS. It is a waste of resources.
>>> >> >>>>
>>> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
>>> >> >>>>
>>> >> >>>> KafkaAgent.sources = kafka-sources
>>> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>> >> >>>>
>>> >> >>>> That will be for your batch layer. To analyse you can directly
>>> >> >>>> read
>>> >> >>>> from
>>> >> >>>> hdfs files with Spark or simply store data in a database of your
>>> >> >>>> choice via
>>> >> >>>> cron or something. Do not mix your batch layer with speed layer.
>>> >> >>>>
>>> >> >>>> Your speed layer will ingest the same data directly from Kafka
>>> >> >>>> into
>>> >> >>>> spark streaming and that will be  online or near real time
>>> >> >>>> (defined
>>> >> >>>> by your
>>> >> >>>> window).
>>> >> >>>>
>>> >> >>>> Then you have a a serving layer to present data from both speed
>>> >> >>>> (the
>>> >> >>>> one from SS) and batch layer.
>>> >> >>>>
>>> >> >>>> HTH
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> Dr Mich Talebzadeh
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> LinkedIn
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> http://talebzadehmich.wordpress.com
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> Disclaimer: Use it at your own risk. Any and all responsibility
>>> >> >>>> for
>>> >> >>>> any
>>> >> >>>> loss, damage or destruction of data or any other property which
>>> >> >>>> may
>>> >> >>>> arise
>>> >> >>>> from relying on this email's technical content is explicitly
>>> >> >>>> disclaimed. The
>>> >> >>>> author will in no case be liable for any monetary damages arising
>>> >> >>>> from such
>>> >> >>>> loss, damage or destruction.
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com>
>>> >> >>>> wrote:
>>> >> >>>>>
>>> >> >>>>> The web UI is actually the speed layer, it needs to be able to
>>> >> >>>>> query
>>> >> >>>>> the data online, and show the results in real-time.
>>> >> >>>>>
>>> >> >>>>> It also needs a custom front-end, so a system like Tableau can't
>>> >> >>>>> be
>>> >> >>>>> used, it must have a custom backend + front-end.
>>> >> >>>>>
>>> >> >>>>> Thanks for the recommendation of Flume. Do you think this will
>>> >> >>>>> work:
>>> >> >>>>>
>>> >> >>>>> - Spark Streaming to read data from Kafka
>>> >> >>>>> - Storing the data on HDFS using Flume
>>> >> >>>>> - Using Spark to query the data in the backend of the web UI?
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
>>> >> >>>>> <mi...@gmail.com> wrote:
>>> >> >>>>>>
>>> >> >>>>>> You need a batch layer and a speed layer. Data from Kafka can
>>> >> >>>>>> be
>>> >> >>>>>> stored on HDFS using flume.
>>> >> >>>>>>
>>> >> >>>>>> -  Query this data to generate reports / analytics (There will
>>> >> >>>>>> be a
>>> >> >>>>>> web UI which will be the front-end to the data, and will show
>>> >> >>>>>> the
>>> >> >>>>>> reports)
>>> >> >>>>>>
>>> >> >>>>>> This is basically batch layer and you need something like
>>> >> >>>>>> Tableau
>>> >> >>>>>> or
>>> >> >>>>>> Zeppelin to query data
>>> >> >>>>>>
>>> >> >>>>>> You will also need spark streaming to query data online for
>>> >> >>>>>> speed
>>> >> >>>>>> layer. That data could be stored in some transient fabric like
>>> >> >>>>>> ignite or
>>> >> >>>>>> even druid.
>>> >> >>>>>>
>>> >> >>>>>> HTH
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> Dr Mich Talebzadeh
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> LinkedIn
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> http://talebzadehmich.wordpress.com
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility
>>> >> >>>>>> for
>>> >> >>>>>> any loss, damage or destruction of data or any other property
>>> >> >>>>>> which
>>> >> >>>>>> may
>>> >> >>>>>> arise from relying on this email's technical content is
>>> >> >>>>>> explicitly
>>> >> >>>>>> disclaimed. The author will in no case be liable for any
>>> >> >>>>>> monetary
>>> >> >>>>>> damages
>>> >> >>>>>> arising from such loss, damage or destruction.
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar
>>> >> >>>>>> <al...@gmail.com>
>>> >> >>>>>> wrote:
>>> >> >>>>>>>
>>> >> >>>>>>> It needs to be able to scale to a very large amount of data,
>>> >> >>>>>>> yes.
>>> >> >>>>>>>
>>> >> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
>>> >> >>>>>>> <de...@gmail.com> wrote:
>>> >> >>>>>>>>
>>> >> >>>>>>>> What is the message inflow ?
>>> >> >>>>>>>> If it's really high , definitely spark will be of great use .
>>> >> >>>>>>>>
>>> >> >>>>>>>> Thanks
>>> >> >>>>>>>> Deepak
>>> >> >>>>>>>>
>>> >> >>>>>>>>
>>> >> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com>
>>> >> >>>>>>>> wrote:
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for
>>> >> >>>>>>>>> ideas.
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and
>>> >> >>>>>>>>> writing
>>> >> >>>>>>>>> their
>>> >> >>>>>>>>> raw data into Kafka.
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> I need to:
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - Do ETL on the data, and standardize it.
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra /
>>> >> >>>>>>>>> Raw
>>> >> >>>>>>>>> HDFS / ElasticSearch / Postgres)
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - Query this data to generate reports / analytics (There
>>> >> >>>>>>>>> will be
>>> >> >>>>>>>>> a
>>> >> >>>>>>>>> web UI which will be the front-end to the data, and will
>>> >> >>>>>>>>> show
>>> >> >>>>>>>>> the reports)
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> Java is being used as the backend language for everything
>>> >> >>>>>>>>> (backend
>>> >> >>>>>>>>> of the web UI, as well as the ETL layer)
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> I'm considering:
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL
>>> >> >>>>>>>>> layer
>>> >> >>>>>>>>> (receive raw data from Kafka, standardize & store it)
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the
>>> >> >>>>>>>>> standardized
>>> >> >>>>>>>>> data, and to allow queries
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - In the backend of the web UI, I could either use Spark to
>>> >> >>>>>>>>> run
>>> >> >>>>>>>>> queries across the data (mostly filters), or directly run
>>> >> >>>>>>>>> queries against
>>> >> >>>>>>>>> Cassandra / HBase
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of these
>>> >> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka
>>> >> >>>>>>>>> consumers vs
>>> >> >>>>>>>>> Spark for
>>> >> >>>>>>>>> ETL, which persistent data store to use, and how to query
>>> >> >>>>>>>>> that
>>> >> >>>>>>>>> data store in
>>> >> >>>>>>>>> the backend of the web UI, for displaying the reports).
>>> >> >>>>>>>>>
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> Thanks.
>>> >> >>>>>>>
>>> >> >>>>>>>
>>> >> >>>>>>
>>> >> >>>>>
>>> >> >>>>
>>> >> >>>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >> Thanks
>>> >> >> Deepak
>>> >> >> www.bigdatabig.com
>>> >> >> www.keosha.net
>>> >> >
>>> >> >
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Thanks
>>> > Deepak
>>> > www.bigdatabig.com
>>> > www.keosha.net
>>
>>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net

Re: Architecture recommendations for a tricky use case

Posted by Deepak Sharma <de...@gmail.com>.

If you use spark direct streams , it ensure end to end guarantee for
messages.


On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar <al...@gmail.com> wrote:

> My concern with Postgres / Cassandra is only scalability. I will look
> further into Postgres horizontal scaling, thanks.
>
> Writes could be idempotent if done as upserts, otherwise updates will be
> idempotent but not inserts.
>
> Data should not be lost. The system should be as fault tolerant as
> possible.
>
> What's the advantage of using Spark for reading Kafka instead of direct
> Kafka consumers?
>
> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <co...@koeninger.org>
> wrote:
>
>> I wouldn't give up the flexibility and maturity of a relational
>> database, unless you have a very specific use case.  I'm not trashing
>> cassandra, I've used cassandra, but if all I know is that you're doing
>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>> aggregations without a lot of forethought.  If you're worried about
>> scaling, there are several options for horizontally scaling Postgres
>> in particular.  One of the current best from what I've worked with is
>> Citus.
>>
>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <de...@gmail.com>
>> wrote:
>> > Hi Cody
>> > Spark direct stream is just fine for this use case.
>> > But why postgres and not cassandra?
>> > Is there anything specific here that i may not be aware?
>> >
>> > Thanks
>> > Deepak
>> >
>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <co...@koeninger.org>
>> wrote:
>> >>
>> >> How are you going to handle etl failures?  Do you care about lost /
>> >> duplicated data?  Are your writes idempotent?
>> >>
>> >> Absent any other information about the problem, I'd stay away from
>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>> >> feeding postgres.
>> >>
>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <al...@gmail.com>
>> wrote:
>> >> > Is there an advantage to that vs directly consuming from Kafka?
>> Nothing
>> >> > is
>> >> > being done to the data except some light ETL and then storing it in
>> >> > Cassandra
>> >> >
>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <
>> deepakmca05@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Its better you use spark's direct stream to ingest from kafka.
>> >> >>
>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <al...@gmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> I don't think I need a different speed storage and batch storage.
>> Just
>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>> somewhere
>> >> >>> where
>> >> >>> the web UI can query it, seems like it will be enough.
>> >> >>>
>> >> >>> I'm thinking about:
>> >> >>>
>> >> >>> - Reading data from Kafka via Spark Streaming
>> >> >>> - Standardizing, then storing it in Cassandra
>> >> >>> - Querying Cassandra from the web ui
>> >> >>>
>> >> >>> That seems like it will work. My question now is whether to use
>> Spark
>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>> >> >>>
>> >> >>>
>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>> >> >>> <mi...@gmail.com> wrote:
>> >> >>>>
>> >> >>>> - Spark Streaming to read data from Kafka
>> >> >>>> - Storing the data on HDFS using Flume
>> >> >>>>
>> >> >>>> You don't need Spark streaming to read data from Kafka and store
>> on
>> >> >>>> HDFS. It is a waste of resources.
>> >> >>>>
>> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
>> >> >>>>
>> >> >>>> KafkaAgent.sources = kafka-sources
>> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>> >> >>>>
>> >> >>>> That will be for your batch layer. To analyse you can directly
>> read
>> >> >>>> from
>> >> >>>> hdfs files with Spark or simply store data in a database of your
>> >> >>>> choice via
>> >> >>>> cron or something. Do not mix your batch layer with speed layer.
>> >> >>>>
>> >> >>>> Your speed layer will ingest the same data directly from Kafka
>> into
>> >> >>>> spark streaming and that will be  online or near real time
>> (defined
>> >> >>>> by your
>> >> >>>> window).
>> >> >>>>
>> >> >>>> Then you have a a serving layer to present data from both speed
>> (the
>> >> >>>> one from SS) and batch layer.
>> >> >>>>
>> >> >>>> HTH
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> Dr Mich Talebzadeh
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> LinkedIn
>> >> >>>>
>> >> >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJ
>> d6zP6AcPCCdOABUrV8Pw
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> http://talebzadehmich.wordpress.com
>> >> >>>>
>> >> >>>>
>> >> >>>> Disclaimer: Use it at your own risk. Any and all responsibility
>> for
>> >> >>>> any
>> >> >>>> loss, damage or destruction of data or any other property which
>> may
>> >> >>>> arise
>> >> >>>> from relying on this email's technical content is explicitly
>> >> >>>> disclaimed. The
>> >> >>>> author will in no case be liable for any monetary damages arising
>> >> >>>> from such
>> >> >>>> loss, damage or destruction.
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com>
>> >> >>>> wrote:
>> >> >>>>>
>> >> >>>>> The web UI is actually the speed layer, it needs to be able to
>> query
>> >> >>>>> the data online, and show the results in real-time.
>> >> >>>>>
>> >> >>>>> It also needs a custom front-end, so a system like Tableau can't
>> be
>> >> >>>>> used, it must have a custom backend + front-end.
>> >> >>>>>
>> >> >>>>> Thanks for the recommendation of Flume. Do you think this will
>> work:
>> >> >>>>>
>> >> >>>>> - Spark Streaming to read data from Kafka
>> >> >>>>> - Storing the data on HDFS using Flume
>> >> >>>>> - Using Spark to query the data in the backend of the web UI?
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
>> >> >>>>> <mi...@gmail.com> wrote:
>> >> >>>>>>
>> >> >>>>>> You need a batch layer and a speed layer. Data from Kafka can be
>> >> >>>>>> stored on HDFS using flume.
>> >> >>>>>>
>> >> >>>>>> -  Query this data to generate reports / analytics (There will
>> be a
>> >> >>>>>> web UI which will be the front-end to the data, and will show
>> the
>> >> >>>>>> reports)
>> >> >>>>>>
>> >> >>>>>> This is basically batch layer and you need something like
>> Tableau
>> >> >>>>>> or
>> >> >>>>>> Zeppelin to query data
>> >> >>>>>>
>> >> >>>>>> You will also need spark streaming to query data online for
>> speed
>> >> >>>>>> layer. That data could be stored in some transient fabric like
>> >> >>>>>> ignite or
>> >> >>>>>> even druid.
>> >> >>>>>>
>> >> >>>>>> HTH
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> Dr Mich Talebzadeh
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> LinkedIn
>> >> >>>>>>
>> >> >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJ
>> d6zP6AcPCCdOABUrV8Pw
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> http://talebzadehmich.wordpress.com
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility
>> for
>> >> >>>>>> any loss, damage or destruction of data or any other property
>> which
>> >> >>>>>> may
>> >> >>>>>> arise from relying on this email's technical content is
>> explicitly
>> >> >>>>>> disclaimed. The author will in no case be liable for any
>> monetary
>> >> >>>>>> damages
>> >> >>>>>> arising from such loss, damage or destruction.
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar <ali.rac200@gmail.com
>> >
>> >> >>>>>> wrote:
>> >> >>>>>>>
>> >> >>>>>>> It needs to be able to scale to a very large amount of data,
>> yes.
>> >> >>>>>>>
>> >> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
>> >> >>>>>>> <de...@gmail.com> wrote:
>> >> >>>>>>>>
>> >> >>>>>>>> What is the message inflow ?
>> >> >>>>>>>> If it's really high , definitely spark will be of great use .
>> >> >>>>>>>>
>> >> >>>>>>>> Thanks
>> >> >>>>>>>> Deepak
>> >> >>>>>>>>
>> >> >>>>>>>>
>> >> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com>
>> wrote:
>> >> >>>>>>>>>
>> >> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>> >> >>>>>>>>>
>> >> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing
>> >> >>>>>>>>> their
>> >> >>>>>>>>> raw data into Kafka.
>> >> >>>>>>>>>
>> >> >>>>>>>>> I need to:
>> >> >>>>>>>>>
>> >> >>>>>>>>> - Do ETL on the data, and standardize it.
>> >> >>>>>>>>>
>> >> >>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra /
>> Raw
>> >> >>>>>>>>> HDFS / ElasticSearch / Postgres)
>> >> >>>>>>>>>
>> >> >>>>>>>>> - Query this data to generate reports / analytics (There
>> will be
>> >> >>>>>>>>> a
>> >> >>>>>>>>> web UI which will be the front-end to the data, and will show
>> >> >>>>>>>>> the reports)
>> >> >>>>>>>>>
>> >> >>>>>>>>> Java is being used as the backend language for everything
>> >> >>>>>>>>> (backend
>> >> >>>>>>>>> of the web UI, as well as the ETL layer)
>> >> >>>>>>>>>
>> >> >>>>>>>>> I'm considering:
>> >> >>>>>>>>>
>> >> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL
>> >> >>>>>>>>> layer
>> >> >>>>>>>>> (receive raw data from Kafka, standardize & store it)
>> >> >>>>>>>>>
>> >> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the
>> >> >>>>>>>>> standardized
>> >> >>>>>>>>> data, and to allow queries
>> >> >>>>>>>>>
>> >> >>>>>>>>> - In the backend of the web UI, I could either use Spark to
>> run
>> >> >>>>>>>>> queries across the data (mostly filters), or directly run
>> >> >>>>>>>>> queries against
>> >> >>>>>>>>> Cassandra / HBase
>> >> >>>>>>>>>
>> >> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of these
>> >> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka
>> consumers vs
>> >> >>>>>>>>> Spark for
>> >> >>>>>>>>> ETL, which persistent data store to use, and how to query
>> that
>> >> >>>>>>>>> data store in
>> >> >>>>>>>>> the backend of the web UI, for displaying the reports).
>> >> >>>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>>>>>> Thanks.
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>
>> >> >>>>>
>> >> >>>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Thanks
>> >> >> Deepak
>> >> >> www.bigdatabig.com
>> >> >> www.keosha.net
>> >> >
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> >>
>> >
>> >
>> >
>> > --
>> > Thanks
>> > Deepak
>> > www.bigdatabig.com
>> > www.keosha.net
>>
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Re: Architecture recommendations for a tricky use case

Posted by Deepak Sharma <de...@gmail.com>.

If you use spark direct streams , it ensure end to end guarantee for
messages.


On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar <al...@gmail.com> wrote:

> My concern with Postgres / Cassandra is only scalability. I will look
> further into Postgres horizontal scaling, thanks.
>
> Writes could be idempotent if done as upserts, otherwise updates will be
> idempotent but not inserts.
>
> Data should not be lost. The system should be as fault tolerant as
> possible.
>
> What's the advantage of using Spark for reading Kafka instead of direct
> Kafka consumers?
>
> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <co...@koeninger.org>
> wrote:
>
>> I wouldn't give up the flexibility and maturity of a relational
>> database, unless you have a very specific use case.  I'm not trashing
>> cassandra, I've used cassandra, but if all I know is that you're doing
>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>> aggregations without a lot of forethought.  If you're worried about
>> scaling, there are several options for horizontally scaling Postgres
>> in particular.  One of the current best from what I've worked with is
>> Citus.
>>
>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <de...@gmail.com>
>> wrote:
>> > Hi Cody
>> > Spark direct stream is just fine for this use case.
>> > But why postgres and not cassandra?
>> > Is there anything specific here that i may not be aware?
>> >
>> > Thanks
>> > Deepak
>> >
>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <co...@koeninger.org>
>> wrote:
>> >>
>> >> How are you going to handle etl failures?  Do you care about lost /
>> >> duplicated data?  Are your writes idempotent?
>> >>
>> >> Absent any other information about the problem, I'd stay away from
>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>> >> feeding postgres.
>> >>
>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <al...@gmail.com>
>> wrote:
>> >> > Is there an advantage to that vs directly consuming from Kafka?
>> Nothing
>> >> > is
>> >> > being done to the data except some light ETL and then storing it in
>> >> > Cassandra
>> >> >
>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <
>> deepakmca05@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Its better you use spark's direct stream to ingest from kafka.
>> >> >>
>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <al...@gmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> I don't think I need a different speed storage and batch storage.
>> Just
>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>> somewhere
>> >> >>> where
>> >> >>> the web UI can query it, seems like it will be enough.
>> >> >>>
>> >> >>> I'm thinking about:
>> >> >>>
>> >> >>> - Reading data from Kafka via Spark Streaming
>> >> >>> - Standardizing, then storing it in Cassandra
>> >> >>> - Querying Cassandra from the web ui
>> >> >>>
>> >> >>> That seems like it will work. My question now is whether to use
>> Spark
>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>> >> >>>
>> >> >>>
>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>> >> >>> <mi...@gmail.com> wrote:
>> >> >>>>
>> >> >>>> - Spark Streaming to read data from Kafka
>> >> >>>> - Storing the data on HDFS using Flume
>> >> >>>>
>> >> >>>> You don't need Spark streaming to read data from Kafka and store
>> on
>> >> >>>> HDFS. It is a waste of resources.
>> >> >>>>
>> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
>> >> >>>>
>> >> >>>> KafkaAgent.sources = kafka-sources
>> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>> >> >>>>
>> >> >>>> That will be for your batch layer. To analyse you can directly
>> read
>> >> >>>> from
>> >> >>>> hdfs files with Spark or simply store data in a database of your
>> >> >>>> choice via
>> >> >>>> cron or something. Do not mix your batch layer with speed layer.
>> >> >>>>
>> >> >>>> Your speed layer will ingest the same data directly from Kafka
>> into
>> >> >>>> spark streaming and that will be  online or near real time
>> (defined
>> >> >>>> by your
>> >> >>>> window).
>> >> >>>>
>> >> >>>> Then you have a a serving layer to present data from both speed
>> (the
>> >> >>>> one from SS) and batch layer.
>> >> >>>>
>> >> >>>> HTH
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> Dr Mich Talebzadeh
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> LinkedIn
>> >> >>>>
>> >> >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJ
>> d6zP6AcPCCdOABUrV8Pw
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> http://talebzadehmich.wordpress.com
>> >> >>>>
>> >> >>>>
>> >> >>>> Disclaimer: Use it at your own risk. Any and all responsibility
>> for
>> >> >>>> any
>> >> >>>> loss, damage or destruction of data or any other property which
>> may
>> >> >>>> arise
>> >> >>>> from relying on this email's technical content is explicitly
>> >> >>>> disclaimed. The
>> >> >>>> author will in no case be liable for any monetary damages arising
>> >> >>>> from such
>> >> >>>> loss, damage or destruction.
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com>
>> >> >>>> wrote:
>> >> >>>>>
>> >> >>>>> The web UI is actually the speed layer, it needs to be able to
>> query
>> >> >>>>> the data online, and show the results in real-time.
>> >> >>>>>
>> >> >>>>> It also needs a custom front-end, so a system like Tableau can't
>> be
>> >> >>>>> used, it must have a custom backend + front-end.
>> >> >>>>>
>> >> >>>>> Thanks for the recommendation of Flume. Do you think this will
>> work:
>> >> >>>>>
>> >> >>>>> - Spark Streaming to read data from Kafka
>> >> >>>>> - Storing the data on HDFS using Flume
>> >> >>>>> - Using Spark to query the data in the backend of the web UI?
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
>> >> >>>>> <mi...@gmail.com> wrote:
>> >> >>>>>>
>> >> >>>>>> You need a batch layer and a speed layer. Data from Kafka can be
>> >> >>>>>> stored on HDFS using flume.
>> >> >>>>>>
>> >> >>>>>> -  Query this data to generate reports / analytics (There will
>> be a
>> >> >>>>>> web UI which will be the front-end to the data, and will show
>> the
>> >> >>>>>> reports)
>> >> >>>>>>
>> >> >>>>>> This is basically batch layer and you need something like
>> Tableau
>> >> >>>>>> or
>> >> >>>>>> Zeppelin to query data
>> >> >>>>>>
>> >> >>>>>> You will also need spark streaming to query data online for
>> speed
>> >> >>>>>> layer. That data could be stored in some transient fabric like
>> >> >>>>>> ignite or
>> >> >>>>>> even druid.
>> >> >>>>>>
>> >> >>>>>> HTH
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> Dr Mich Talebzadeh
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> LinkedIn
>> >> >>>>>>
>> >> >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJ
>> d6zP6AcPCCdOABUrV8Pw
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> http://talebzadehmich.wordpress.com
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility
>> for
>> >> >>>>>> any loss, damage or destruction of data or any other property
>> which
>> >> >>>>>> may
>> >> >>>>>> arise from relying on this email's technical content is
>> explicitly
>> >> >>>>>> disclaimed. The author will in no case be liable for any
>> monetary
>> >> >>>>>> damages
>> >> >>>>>> arising from such loss, damage or destruction.
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar <ali.rac200@gmail.com
>> >
>> >> >>>>>> wrote:
>> >> >>>>>>>
>> >> >>>>>>> It needs to be able to scale to a very large amount of data,
>> yes.
>> >> >>>>>>>
>> >> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
>> >> >>>>>>> <de...@gmail.com> wrote:
>> >> >>>>>>>>
>> >> >>>>>>>> What is the message inflow ?
>> >> >>>>>>>> If it's really high , definitely spark will be of great use .
>> >> >>>>>>>>
>> >> >>>>>>>> Thanks
>> >> >>>>>>>> Deepak
>> >> >>>>>>>>
>> >> >>>>>>>>
>> >> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com>
>> wrote:
>> >> >>>>>>>>>
>> >> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>> >> >>>>>>>>>
>> >> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing
>> >> >>>>>>>>> their
>> >> >>>>>>>>> raw data into Kafka.
>> >> >>>>>>>>>
>> >> >>>>>>>>> I need to:
>> >> >>>>>>>>>
>> >> >>>>>>>>> - Do ETL on the data, and standardize it.
>> >> >>>>>>>>>
>> >> >>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra /
>> Raw
>> >> >>>>>>>>> HDFS / ElasticSearch / Postgres)
>> >> >>>>>>>>>
>> >> >>>>>>>>> - Query this data to generate reports / analytics (There
>> will be
>> >> >>>>>>>>> a
>> >> >>>>>>>>> web UI which will be the front-end to the data, and will show
>> >> >>>>>>>>> the reports)
>> >> >>>>>>>>>
>> >> >>>>>>>>> Java is being used as the backend language for everything
>> >> >>>>>>>>> (backend
>> >> >>>>>>>>> of the web UI, as well as the ETL layer)
>> >> >>>>>>>>>
>> >> >>>>>>>>> I'm considering:
>> >> >>>>>>>>>
>> >> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL
>> >> >>>>>>>>> layer
>> >> >>>>>>>>> (receive raw data from Kafka, standardize & store it)
>> >> >>>>>>>>>
>> >> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the
>> >> >>>>>>>>> standardized
>> >> >>>>>>>>> data, and to allow queries
>> >> >>>>>>>>>
>> >> >>>>>>>>> - In the backend of the web UI, I could either use Spark to
>> run
>> >> >>>>>>>>> queries across the data (mostly filters), or directly run
>> >> >>>>>>>>> queries against
>> >> >>>>>>>>> Cassandra / HBase
>> >> >>>>>>>>>
>> >> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of these
>> >> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka
>> consumers vs
>> >> >>>>>>>>> Spark for
>> >> >>>>>>>>> ETL, which persistent data store to use, and how to query
>> that
>> >> >>>>>>>>> data store in
>> >> >>>>>>>>> the backend of the web UI, for displaying the reports).
>> >> >>>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>>>>>> Thanks.
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>
>> >> >>>>>
>> >> >>>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Thanks
>> >> >> Deepak
>> >> >> www.bigdatabig.com
>> >> >> www.keosha.net
>> >> >
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> >>
>> >
>> >
>> >
>> > --
>> > Thanks
>> > Deepak
>> > www.bigdatabig.com
>> > www.keosha.net
>>
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Re: Architecture recommendations for a tricky use case

Posted by Cody Koeninger <co...@koeninger.org>.

If you're doing any kind of pre-aggregation during ETL, spark direct
stream will let you more easily get the delivery semantics you need,
especially if you're using a transactional data store.

If you're literally just copying individual uniquely keyed items from
kafka to a key-value store, use kafka consumers, sure.

On Thu, Sep 29, 2016 at 10:35 AM, Ali Akhtar <al...@gmail.com> wrote:
> My concern with Postgres / Cassandra is only scalability. I will look
> further into Postgres horizontal scaling, thanks.
>
> Writes could be idempotent if done as upserts, otherwise updates will be
> idempotent but not inserts.
>
> Data should not be lost. The system should be as fault tolerant as possible.
>
> What's the advantage of using Spark for reading Kafka instead of direct
> Kafka consumers?
>
> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <co...@koeninger.org> wrote:
>>
>> I wouldn't give up the flexibility and maturity of a relational
>> database, unless you have a very specific use case.  I'm not trashing
>> cassandra, I've used cassandra, but if all I know is that you're doing
>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>> aggregations without a lot of forethought.  If you're worried about
>> scaling, there are several options for horizontally scaling Postgres
>> in particular.  One of the current best from what I've worked with is
>> Citus.
>>
>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <de...@gmail.com>
>> wrote:
>> > Hi Cody
>> > Spark direct stream is just fine for this use case.
>> > But why postgres and not cassandra?
>> > Is there anything specific here that i may not be aware?
>> >
>> > Thanks
>> > Deepak
>> >
>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <co...@koeninger.org>
>> > wrote:
>> >>
>> >> How are you going to handle etl failures?  Do you care about lost /
>> >> duplicated data?  Are your writes idempotent?
>> >>
>> >> Absent any other information about the problem, I'd stay away from
>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>> >> feeding postgres.
>> >>
>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <al...@gmail.com>
>> >> wrote:
>> >> > Is there an advantage to that vs directly consuming from Kafka?
>> >> > Nothing
>> >> > is
>> >> > being done to the data except some light ETL and then storing it in
>> >> > Cassandra
>> >> >
>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma
>> >> > <de...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Its better you use spark's direct stream to ingest from kafka.
>> >> >>
>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <al...@gmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> I don't think I need a different speed storage and batch storage.
>> >> >>> Just
>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>> >> >>> somewhere
>> >> >>> where
>> >> >>> the web UI can query it, seems like it will be enough.
>> >> >>>
>> >> >>> I'm thinking about:
>> >> >>>
>> >> >>> - Reading data from Kafka via Spark Streaming
>> >> >>> - Standardizing, then storing it in Cassandra
>> >> >>> - Querying Cassandra from the web ui
>> >> >>>
>> >> >>> That seems like it will work. My question now is whether to use
>> >> >>> Spark
>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>> >> >>>
>> >> >>>
>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>> >> >>> <mi...@gmail.com> wrote:
>> >> >>>>
>> >> >>>> - Spark Streaming to read data from Kafka
>> >> >>>> - Storing the data on HDFS using Flume
>> >> >>>>
>> >> >>>> You don't need Spark streaming to read data from Kafka and store
>> >> >>>> on
>> >> >>>> HDFS. It is a waste of resources.
>> >> >>>>
>> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
>> >> >>>>
>> >> >>>> KafkaAgent.sources = kafka-sources
>> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>> >> >>>>
>> >> >>>> That will be for your batch layer. To analyse you can directly
>> >> >>>> read
>> >> >>>> from
>> >> >>>> hdfs files with Spark or simply store data in a database of your
>> >> >>>> choice via
>> >> >>>> cron or something. Do not mix your batch layer with speed layer.
>> >> >>>>
>> >> >>>> Your speed layer will ingest the same data directly from Kafka
>> >> >>>> into
>> >> >>>> spark streaming and that will be  online or near real time
>> >> >>>> (defined
>> >> >>>> by your
>> >> >>>> window).
>> >> >>>>
>> >> >>>> Then you have a a serving layer to present data from both speed
>> >> >>>> (the
>> >> >>>> one from SS) and batch layer.
>> >> >>>>
>> >> >>>> HTH
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> Dr Mich Talebzadeh
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> LinkedIn
>> >> >>>>
>> >> >>>>
>> >> >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> http://talebzadehmich.wordpress.com
>> >> >>>>
>> >> >>>>
>> >> >>>> Disclaimer: Use it at your own risk. Any and all responsibility
>> >> >>>> for
>> >> >>>> any
>> >> >>>> loss, damage or destruction of data or any other property which
>> >> >>>> may
>> >> >>>> arise
>> >> >>>> from relying on this email's technical content is explicitly
>> >> >>>> disclaimed. The
>> >> >>>> author will in no case be liable for any monetary damages arising
>> >> >>>> from such
>> >> >>>> loss, damage or destruction.
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com>
>> >> >>>> wrote:
>> >> >>>>>
>> >> >>>>> The web UI is actually the speed layer, it needs to be able to
>> >> >>>>> query
>> >> >>>>> the data online, and show the results in real-time.
>> >> >>>>>
>> >> >>>>> It also needs a custom front-end, so a system like Tableau can't
>> >> >>>>> be
>> >> >>>>> used, it must have a custom backend + front-end.
>> >> >>>>>
>> >> >>>>> Thanks for the recommendation of Flume. Do you think this will
>> >> >>>>> work:
>> >> >>>>>
>> >> >>>>> - Spark Streaming to read data from Kafka
>> >> >>>>> - Storing the data on HDFS using Flume
>> >> >>>>> - Using Spark to query the data in the backend of the web UI?
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
>> >> >>>>> <mi...@gmail.com> wrote:
>> >> >>>>>>
>> >> >>>>>> You need a batch layer and a speed layer. Data from Kafka can be
>> >> >>>>>> stored on HDFS using flume.
>> >> >>>>>>
>> >> >>>>>> -  Query this data to generate reports / analytics (There will
>> >> >>>>>> be a
>> >> >>>>>> web UI which will be the front-end to the data, and will show
>> >> >>>>>> the
>> >> >>>>>> reports)
>> >> >>>>>>
>> >> >>>>>> This is basically batch layer and you need something like
>> >> >>>>>> Tableau
>> >> >>>>>> or
>> >> >>>>>> Zeppelin to query data
>> >> >>>>>>
>> >> >>>>>> You will also need spark streaming to query data online for
>> >> >>>>>> speed
>> >> >>>>>> layer. That data could be stored in some transient fabric like
>> >> >>>>>> ignite or
>> >> >>>>>> even druid.
>> >> >>>>>>
>> >> >>>>>> HTH
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> Dr Mich Talebzadeh
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> LinkedIn
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> http://talebzadehmich.wordpress.com
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility
>> >> >>>>>> for
>> >> >>>>>> any loss, damage or destruction of data or any other property
>> >> >>>>>> which
>> >> >>>>>> may
>> >> >>>>>> arise from relying on this email's technical content is
>> >> >>>>>> explicitly
>> >> >>>>>> disclaimed. The author will in no case be liable for any
>> >> >>>>>> monetary
>> >> >>>>>> damages
>> >> >>>>>> arising from such loss, damage or destruction.
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com>
>> >> >>>>>> wrote:
>> >> >>>>>>>
>> >> >>>>>>> It needs to be able to scale to a very large amount of data,
>> >> >>>>>>> yes.
>> >> >>>>>>>
>> >> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
>> >> >>>>>>> <de...@gmail.com> wrote:
>> >> >>>>>>>>
>> >> >>>>>>>> What is the message inflow ?
>> >> >>>>>>>> If it's really high , definitely spark will be of great use .
>> >> >>>>>>>>
>> >> >>>>>>>> Thanks
>> >> >>>>>>>> Deepak
>> >> >>>>>>>>
>> >> >>>>>>>>
>> >> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com>
>> >> >>>>>>>> wrote:
>> >> >>>>>>>>>
>> >> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>> >> >>>>>>>>>
>> >> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing
>> >> >>>>>>>>> their
>> >> >>>>>>>>> raw data into Kafka.
>> >> >>>>>>>>>
>> >> >>>>>>>>> I need to:
>> >> >>>>>>>>>
>> >> >>>>>>>>> - Do ETL on the data, and standardize it.
>> >> >>>>>>>>>
>> >> >>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra /
>> >> >>>>>>>>> Raw
>> >> >>>>>>>>> HDFS / ElasticSearch / Postgres)
>> >> >>>>>>>>>
>> >> >>>>>>>>> - Query this data to generate reports / analytics (There will
>> >> >>>>>>>>> be
>> >> >>>>>>>>> a
>> >> >>>>>>>>> web UI which will be the front-end to the data, and will show
>> >> >>>>>>>>> the reports)
>> >> >>>>>>>>>
>> >> >>>>>>>>> Java is being used as the backend language for everything
>> >> >>>>>>>>> (backend
>> >> >>>>>>>>> of the web UI, as well as the ETL layer)
>> >> >>>>>>>>>
>> >> >>>>>>>>> I'm considering:
>> >> >>>>>>>>>
>> >> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL
>> >> >>>>>>>>> layer
>> >> >>>>>>>>> (receive raw data from Kafka, standardize & store it)
>> >> >>>>>>>>>
>> >> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the
>> >> >>>>>>>>> standardized
>> >> >>>>>>>>> data, and to allow queries
>> >> >>>>>>>>>
>> >> >>>>>>>>> - In the backend of the web UI, I could either use Spark to
>> >> >>>>>>>>> run
>> >> >>>>>>>>> queries across the data (mostly filters), or directly run
>> >> >>>>>>>>> queries against
>> >> >>>>>>>>> Cassandra / HBase
>> >> >>>>>>>>>
>> >> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of these
>> >> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers
>> >> >>>>>>>>> vs
>> >> >>>>>>>>> Spark for
>> >> >>>>>>>>> ETL, which persistent data store to use, and how to query
>> >> >>>>>>>>> that
>> >> >>>>>>>>> data store in
>> >> >>>>>>>>> the backend of the web UI, for displaying the reports).
>> >> >>>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>>>>>> Thanks.
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>
>> >> >>>>>
>> >> >>>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Thanks
>> >> >> Deepak
>> >> >> www.bigdatabig.com
>> >> >> www.keosha.net
>> >> >
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> >>
>> >
>> >
>> >
>> > --
>> > Thanks
>> > Deepak
>> > www.bigdatabig.com
>> > www.keosha.net
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Architecture recommendations for a tricky use case

Posted by Cody Koeninger <co...@koeninger.org>.

If you're doing any kind of pre-aggregation during ETL, spark direct
stream will let you more easily get the delivery semantics you need,
especially if you're using a transactional data store.

If you're literally just copying individual uniquely keyed items from
kafka to a key-value store, use kafka consumers, sure.

On Thu, Sep 29, 2016 at 10:35 AM, Ali Akhtar <al...@gmail.com> wrote:
> My concern with Postgres / Cassandra is only scalability. I will look
> further into Postgres horizontal scaling, thanks.
>
> Writes could be idempotent if done as upserts, otherwise updates will be
> idempotent but not inserts.
>
> Data should not be lost. The system should be as fault tolerant as possible.
>
> What's the advantage of using Spark for reading Kafka instead of direct
> Kafka consumers?
>
> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <co...@koeninger.org> wrote:
>>
>> I wouldn't give up the flexibility and maturity of a relational
>> database, unless you have a very specific use case.  I'm not trashing
>> cassandra, I've used cassandra, but if all I know is that you're doing
>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>> aggregations without a lot of forethought.  If you're worried about
>> scaling, there are several options for horizontally scaling Postgres
>> in particular.  One of the current best from what I've worked with is
>> Citus.
>>
>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <de...@gmail.com>
>> wrote:
>> > Hi Cody
>> > Spark direct stream is just fine for this use case.
>> > But why postgres and not cassandra?
>> > Is there anything specific here that i may not be aware?
>> >
>> > Thanks
>> > Deepak
>> >
>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <co...@koeninger.org>
>> > wrote:
>> >>
>> >> How are you going to handle etl failures?  Do you care about lost /
>> >> duplicated data?  Are your writes idempotent?
>> >>
>> >> Absent any other information about the problem, I'd stay away from
>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>> >> feeding postgres.
>> >>
>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <al...@gmail.com>
>> >> wrote:
>> >> > Is there an advantage to that vs directly consuming from Kafka?
>> >> > Nothing
>> >> > is
>> >> > being done to the data except some light ETL and then storing it in
>> >> > Cassandra
>> >> >
>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma
>> >> > <de...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Its better you use spark's direct stream to ingest from kafka.
>> >> >>
>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <al...@gmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> I don't think I need a different speed storage and batch storage.
>> >> >>> Just
>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>> >> >>> somewhere
>> >> >>> where
>> >> >>> the web UI can query it, seems like it will be enough.
>> >> >>>
>> >> >>> I'm thinking about:
>> >> >>>
>> >> >>> - Reading data from Kafka via Spark Streaming
>> >> >>> - Standardizing, then storing it in Cassandra
>> >> >>> - Querying Cassandra from the web ui
>> >> >>>
>> >> >>> That seems like it will work. My question now is whether to use
>> >> >>> Spark
>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>> >> >>>
>> >> >>>
>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>> >> >>> <mi...@gmail.com> wrote:
>> >> >>>>
>> >> >>>> - Spark Streaming to read data from Kafka
>> >> >>>> - Storing the data on HDFS using Flume
>> >> >>>>
>> >> >>>> You don't need Spark streaming to read data from Kafka and store
>> >> >>>> on
>> >> >>>> HDFS. It is a waste of resources.
>> >> >>>>
>> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
>> >> >>>>
>> >> >>>> KafkaAgent.sources = kafka-sources
>> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>> >> >>>>
>> >> >>>> That will be for your batch layer. To analyse you can directly
>> >> >>>> read
>> >> >>>> from
>> >> >>>> hdfs files with Spark or simply store data in a database of your
>> >> >>>> choice via
>> >> >>>> cron or something. Do not mix your batch layer with speed layer.
>> >> >>>>
>> >> >>>> Your speed layer will ingest the same data directly from Kafka
>> >> >>>> into
>> >> >>>> spark streaming and that will be  online or near real time
>> >> >>>> (defined
>> >> >>>> by your
>> >> >>>> window).
>> >> >>>>
>> >> >>>> Then you have a a serving layer to present data from both speed
>> >> >>>> (the
>> >> >>>> one from SS) and batch layer.
>> >> >>>>
>> >> >>>> HTH
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> Dr Mich Talebzadeh
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> LinkedIn
>> >> >>>>
>> >> >>>>
>> >> >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> http://talebzadehmich.wordpress.com
>> >> >>>>
>> >> >>>>
>> >> >>>> Disclaimer: Use it at your own risk. Any and all responsibility
>> >> >>>> for
>> >> >>>> any
>> >> >>>> loss, damage or destruction of data or any other property which
>> >> >>>> may
>> >> >>>> arise
>> >> >>>> from relying on this email's technical content is explicitly
>> >> >>>> disclaimed. The
>> >> >>>> author will in no case be liable for any monetary damages arising
>> >> >>>> from such
>> >> >>>> loss, damage or destruction.
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com>
>> >> >>>> wrote:
>> >> >>>>>
>> >> >>>>> The web UI is actually the speed layer, it needs to be able to
>> >> >>>>> query
>> >> >>>>> the data online, and show the results in real-time.
>> >> >>>>>
>> >> >>>>> It also needs a custom front-end, so a system like Tableau can't
>> >> >>>>> be
>> >> >>>>> used, it must have a custom backend + front-end.
>> >> >>>>>
>> >> >>>>> Thanks for the recommendation of Flume. Do you think this will
>> >> >>>>> work:
>> >> >>>>>
>> >> >>>>> - Spark Streaming to read data from Kafka
>> >> >>>>> - Storing the data on HDFS using Flume
>> >> >>>>> - Using Spark to query the data in the backend of the web UI?
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
>> >> >>>>> <mi...@gmail.com> wrote:
>> >> >>>>>>
>> >> >>>>>> You need a batch layer and a speed layer. Data from Kafka can be
>> >> >>>>>> stored on HDFS using flume.
>> >> >>>>>>
>> >> >>>>>> -  Query this data to generate reports / analytics (There will
>> >> >>>>>> be a
>> >> >>>>>> web UI which will be the front-end to the data, and will show
>> >> >>>>>> the
>> >> >>>>>> reports)
>> >> >>>>>>
>> >> >>>>>> This is basically batch layer and you need something like
>> >> >>>>>> Tableau
>> >> >>>>>> or
>> >> >>>>>> Zeppelin to query data
>> >> >>>>>>
>> >> >>>>>> You will also need spark streaming to query data online for
>> >> >>>>>> speed
>> >> >>>>>> layer. That data could be stored in some transient fabric like
>> >> >>>>>> ignite or
>> >> >>>>>> even druid.
>> >> >>>>>>
>> >> >>>>>> HTH
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> Dr Mich Talebzadeh
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> LinkedIn
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> http://talebzadehmich.wordpress.com
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility
>> >> >>>>>> for
>> >> >>>>>> any loss, damage or destruction of data or any other property
>> >> >>>>>> which
>> >> >>>>>> may
>> >> >>>>>> arise from relying on this email's technical content is
>> >> >>>>>> explicitly
>> >> >>>>>> disclaimed. The author will in no case be liable for any
>> >> >>>>>> monetary
>> >> >>>>>> damages
>> >> >>>>>> arising from such loss, damage or destruction.
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com>
>> >> >>>>>> wrote:
>> >> >>>>>>>
>> >> >>>>>>> It needs to be able to scale to a very large amount of data,
>> >> >>>>>>> yes.
>> >> >>>>>>>
>> >> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
>> >> >>>>>>> <de...@gmail.com> wrote:
>> >> >>>>>>>>
>> >> >>>>>>>> What is the message inflow ?
>> >> >>>>>>>> If it's really high , definitely spark will be of great use .
>> >> >>>>>>>>
>> >> >>>>>>>> Thanks
>> >> >>>>>>>> Deepak
>> >> >>>>>>>>
>> >> >>>>>>>>
>> >> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com>
>> >> >>>>>>>> wrote:
>> >> >>>>>>>>>
>> >> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>> >> >>>>>>>>>
>> >> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing
>> >> >>>>>>>>> their
>> >> >>>>>>>>> raw data into Kafka.
>> >> >>>>>>>>>
>> >> >>>>>>>>> I need to:
>> >> >>>>>>>>>
>> >> >>>>>>>>> - Do ETL on the data, and standardize it.
>> >> >>>>>>>>>
>> >> >>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra /
>> >> >>>>>>>>> Raw
>> >> >>>>>>>>> HDFS / ElasticSearch / Postgres)
>> >> >>>>>>>>>
>> >> >>>>>>>>> - Query this data to generate reports / analytics (There will
>> >> >>>>>>>>> be
>> >> >>>>>>>>> a
>> >> >>>>>>>>> web UI which will be the front-end to the data, and will show
>> >> >>>>>>>>> the reports)
>> >> >>>>>>>>>
>> >> >>>>>>>>> Java is being used as the backend language for everything
>> >> >>>>>>>>> (backend
>> >> >>>>>>>>> of the web UI, as well as the ETL layer)
>> >> >>>>>>>>>
>> >> >>>>>>>>> I'm considering:
>> >> >>>>>>>>>
>> >> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL
>> >> >>>>>>>>> layer
>> >> >>>>>>>>> (receive raw data from Kafka, standardize & store it)
>> >> >>>>>>>>>
>> >> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the
>> >> >>>>>>>>> standardized
>> >> >>>>>>>>> data, and to allow queries
>> >> >>>>>>>>>
>> >> >>>>>>>>> - In the backend of the web UI, I could either use Spark to
>> >> >>>>>>>>> run
>> >> >>>>>>>>> queries across the data (mostly filters), or directly run
>> >> >>>>>>>>> queries against
>> >> >>>>>>>>> Cassandra / HBase
>> >> >>>>>>>>>
>> >> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of these
>> >> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers
>> >> >>>>>>>>> vs
>> >> >>>>>>>>> Spark for
>> >> >>>>>>>>> ETL, which persistent data store to use, and how to query
>> >> >>>>>>>>> that
>> >> >>>>>>>>> data store in
>> >> >>>>>>>>> the backend of the web UI, for displaying the reports).
>> >> >>>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>>>>>> Thanks.
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>
>> >> >>>>>
>> >> >>>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Thanks
>> >> >> Deepak
>> >> >> www.bigdatabig.com
>> >> >> www.keosha.net
>> >> >
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> >>
>> >
>> >
>> >
>> > --
>> > Thanks
>> > Deepak
>> > www.bigdatabig.com
>> > www.keosha.net
>
>

Re: Architecture recommendations for a tricky use case

Posted by Ali Akhtar <al...@gmail.com>.

My concern with Postgres / Cassandra is only scalability. I will look
further into Postgres horizontal scaling, thanks.

Writes could be idempotent if done as upserts, otherwise updates will be
idempotent but not inserts.

Data should not be lost. The system should be as fault tolerant as possible.

What's the advantage of using Spark for reading Kafka instead of direct
Kafka consumers?

On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <co...@koeninger.org> wrote:

> I wouldn't give up the flexibility and maturity of a relational
> database, unless you have a very specific use case.  I'm not trashing
> cassandra, I've used cassandra, but if all I know is that you're doing
> analytics, I wouldn't want to give up the ability to easily do ad-hoc
> aggregations without a lot of forethought.  If you're worried about
> scaling, there are several options for horizontally scaling Postgres
> in particular.  One of the current best from what I've worked with is
> Citus.
>
> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <de...@gmail.com>
> wrote:
> > Hi Cody
> > Spark direct stream is just fine for this use case.
> > But why postgres and not cassandra?
> > Is there anything specific here that i may not be aware?
> >
> > Thanks
> > Deepak
> >
> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <co...@koeninger.org>
> wrote:
> >>
> >> How are you going to handle etl failures?  Do you care about lost /
> >> duplicated data?  Are your writes idempotent?
> >>
> >> Absent any other information about the problem, I'd stay away from
> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
> >> feeding postgres.
> >>
> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <al...@gmail.com>
> wrote:
> >> > Is there an advantage to that vs directly consuming from Kafka?
> Nothing
> >> > is
> >> > being done to the data except some light ETL and then storing it in
> >> > Cassandra
> >> >
> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <deepakmca05@gmail.com
> >
> >> > wrote:
> >> >>
> >> >> Its better you use spark's direct stream to ingest from kafka.
> >> >>
> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <al...@gmail.com>
> >> >> wrote:
> >> >>>
> >> >>> I don't think I need a different speed storage and batch storage.
> Just
> >> >>> taking in raw data from Kafka, standardizing, and storing it
> somewhere
> >> >>> where
> >> >>> the web UI can query it, seems like it will be enough.
> >> >>>
> >> >>> I'm thinking about:
> >> >>>
> >> >>> - Reading data from Kafka via Spark Streaming
> >> >>> - Standardizing, then storing it in Cassandra
> >> >>> - Querying Cassandra from the web ui
> >> >>>
> >> >>> That seems like it will work. My question now is whether to use
> Spark
> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
> >> >>>
> >> >>>
> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
> >> >>> <mi...@gmail.com> wrote:
> >> >>>>
> >> >>>> - Spark Streaming to read data from Kafka
> >> >>>> - Storing the data on HDFS using Flume
> >> >>>>
> >> >>>> You don't need Spark streaming to read data from Kafka and store on
> >> >>>> HDFS. It is a waste of resources.
> >> >>>>
> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
> >> >>>>
> >> >>>> KafkaAgent.sources = kafka-sources
> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
> >> >>>>
> >> >>>> That will be for your batch layer. To analyse you can directly read
> >> >>>> from
> >> >>>> hdfs files with Spark or simply store data in a database of your
> >> >>>> choice via
> >> >>>> cron or something. Do not mix your batch layer with speed layer.
> >> >>>>
> >> >>>> Your speed layer will ingest the same data directly from Kafka into
> >> >>>> spark streaming and that will be  online or near real time (defined
> >> >>>> by your
> >> >>>> window).
> >> >>>>
> >> >>>> Then you have a a serving layer to present data from both speed
> (the
> >> >>>> one from SS) and batch layer.
> >> >>>>
> >> >>>> HTH
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> Dr Mich Talebzadeh
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> LinkedIn
> >> >>>>
> >> >>>> https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> http://talebzadehmich.wordpress.com
> >> >>>>
> >> >>>>
> >> >>>> Disclaimer: Use it at your own risk. Any and all responsibility for
> >> >>>> any
> >> >>>> loss, damage or destruction of data or any other property which may
> >> >>>> arise
> >> >>>> from relying on this email's technical content is explicitly
> >> >>>> disclaimed. The
> >> >>>> author will in no case be liable for any monetary damages arising
> >> >>>> from such
> >> >>>> loss, damage or destruction.
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com>
> >> >>>> wrote:
> >> >>>>>
> >> >>>>> The web UI is actually the speed layer, it needs to be able to
> query
> >> >>>>> the data online, and show the results in real-time.
> >> >>>>>
> >> >>>>> It also needs a custom front-end, so a system like Tableau can't
> be
> >> >>>>> used, it must have a custom backend + front-end.
> >> >>>>>
> >> >>>>> Thanks for the recommendation of Flume. Do you think this will
> work:
> >> >>>>>
> >> >>>>> - Spark Streaming to read data from Kafka
> >> >>>>> - Storing the data on HDFS using Flume
> >> >>>>> - Using Spark to query the data in the backend of the web UI?
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
> >> >>>>> <mi...@gmail.com> wrote:
> >> >>>>>>
> >> >>>>>> You need a batch layer and a speed layer. Data from Kafka can be
> >> >>>>>> stored on HDFS using flume.
> >> >>>>>>
> >> >>>>>> -  Query this data to generate reports / analytics (There will
> be a
> >> >>>>>> web UI which will be the front-end to the data, and will show the
> >> >>>>>> reports)
> >> >>>>>>
> >> >>>>>> This is basically batch layer and you need something like Tableau
> >> >>>>>> or
> >> >>>>>> Zeppelin to query data
> >> >>>>>>
> >> >>>>>> You will also need spark streaming to query data online for speed
> >> >>>>>> layer. That data could be stored in some transient fabric like
> >> >>>>>> ignite or
> >> >>>>>> even druid.
> >> >>>>>>
> >> >>>>>> HTH
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Dr Mich Talebzadeh
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> LinkedIn
> >> >>>>>>
> >> >>>>>> https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> http://talebzadehmich.wordpress.com
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility
> for
> >> >>>>>> any loss, damage or destruction of data or any other property
> which
> >> >>>>>> may
> >> >>>>>> arise from relying on this email's technical content is
> explicitly
> >> >>>>>> disclaimed. The author will in no case be liable for any monetary
> >> >>>>>> damages
> >> >>>>>> arising from such loss, damage or destruction.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com>
> >> >>>>>> wrote:
> >> >>>>>>>
> >> >>>>>>> It needs to be able to scale to a very large amount of data,
> yes.
> >> >>>>>>>
> >> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
> >> >>>>>>> <de...@gmail.com> wrote:
> >> >>>>>>>>
> >> >>>>>>>> What is the message inflow ?
> >> >>>>>>>> If it's really high , definitely spark will be of great use .
> >> >>>>>>>>
> >> >>>>>>>> Thanks
> >> >>>>>>>> Deepak
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com>
> wrote:
> >> >>>>>>>>>
> >> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
> >> >>>>>>>>>
> >> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing
> >> >>>>>>>>> their
> >> >>>>>>>>> raw data into Kafka.
> >> >>>>>>>>>
> >> >>>>>>>>> I need to:
> >> >>>>>>>>>
> >> >>>>>>>>> - Do ETL on the data, and standardize it.
> >> >>>>>>>>>
> >> >>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra /
> Raw
> >> >>>>>>>>> HDFS / ElasticSearch / Postgres)
> >> >>>>>>>>>
> >> >>>>>>>>> - Query this data to generate reports / analytics (There will
> be
> >> >>>>>>>>> a
> >> >>>>>>>>> web UI which will be the front-end to the data, and will show
> >> >>>>>>>>> the reports)
> >> >>>>>>>>>
> >> >>>>>>>>> Java is being used as the backend language for everything
> >> >>>>>>>>> (backend
> >> >>>>>>>>> of the web UI, as well as the ETL layer)
> >> >>>>>>>>>
> >> >>>>>>>>> I'm considering:
> >> >>>>>>>>>
> >> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL
> >> >>>>>>>>> layer
> >> >>>>>>>>> (receive raw data from Kafka, standardize & store it)
> >> >>>>>>>>>
> >> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the
> >> >>>>>>>>> standardized
> >> >>>>>>>>> data, and to allow queries
> >> >>>>>>>>>
> >> >>>>>>>>> - In the backend of the web UI, I could either use Spark to
> run
> >> >>>>>>>>> queries across the data (mostly filters), or directly run
> >> >>>>>>>>> queries against
> >> >>>>>>>>> Cassandra / HBase
> >> >>>>>>>>>
> >> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of these
> >> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers
> vs
> >> >>>>>>>>> Spark for
> >> >>>>>>>>> ETL, which persistent data store to use, and how to query that
> >> >>>>>>>>> data store in
> >> >>>>>>>>> the backend of the web UI, for displaying the reports).
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> Thanks.
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>>
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Thanks
> >> >> Deepak
> >> >> www.bigdatabig.com
> >> >> www.keosha.net
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >>
> >
> >
> >
> > --
> > Thanks
> > Deepak
> > www.bigdatabig.com
> > www.keosha.net
>

Re: Architecture recommendations for a tricky use case

Posted by Ali Akhtar <al...@gmail.com>.

My concern with Postgres / Cassandra is only scalability. I will look
further into Postgres horizontal scaling, thanks.

Writes could be idempotent if done as upserts, otherwise updates will be
idempotent but not inserts.

Data should not be lost. The system should be as fault tolerant as possible.

What's the advantage of using Spark for reading Kafka instead of direct
Kafka consumers?

On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <co...@koeninger.org> wrote:

> I wouldn't give up the flexibility and maturity of a relational
> database, unless you have a very specific use case.  I'm not trashing
> cassandra, I've used cassandra, but if all I know is that you're doing
> analytics, I wouldn't want to give up the ability to easily do ad-hoc
> aggregations without a lot of forethought.  If you're worried about
> scaling, there are several options for horizontally scaling Postgres
> in particular.  One of the current best from what I've worked with is
> Citus.
>
> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <de...@gmail.com>
> wrote:
> > Hi Cody
> > Spark direct stream is just fine for this use case.
> > But why postgres and not cassandra?
> > Is there anything specific here that i may not be aware?
> >
> > Thanks
> > Deepak
> >
> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <co...@koeninger.org>
> wrote:
> >>
> >> How are you going to handle etl failures?  Do you care about lost /
> >> duplicated data?  Are your writes idempotent?
> >>
> >> Absent any other information about the problem, I'd stay away from
> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
> >> feeding postgres.
> >>
> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <al...@gmail.com>
> wrote:
> >> > Is there an advantage to that vs directly consuming from Kafka?
> Nothing
> >> > is
> >> > being done to the data except some light ETL and then storing it in
> >> > Cassandra
> >> >
> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <deepakmca05@gmail.com
> >
> >> > wrote:
> >> >>
> >> >> Its better you use spark's direct stream to ingest from kafka.
> >> >>
> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <al...@gmail.com>
> >> >> wrote:
> >> >>>
> >> >>> I don't think I need a different speed storage and batch storage.
> Just
> >> >>> taking in raw data from Kafka, standardizing, and storing it
> somewhere
> >> >>> where
> >> >>> the web UI can query it, seems like it will be enough.
> >> >>>
> >> >>> I'm thinking about:
> >> >>>
> >> >>> - Reading data from Kafka via Spark Streaming
> >> >>> - Standardizing, then storing it in Cassandra
> >> >>> - Querying Cassandra from the web ui
> >> >>>
> >> >>> That seems like it will work. My question now is whether to use
> Spark
> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
> >> >>>
> >> >>>
> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
> >> >>> <mi...@gmail.com> wrote:
> >> >>>>
> >> >>>> - Spark Streaming to read data from Kafka
> >> >>>> - Storing the data on HDFS using Flume
> >> >>>>
> >> >>>> You don't need Spark streaming to read data from Kafka and store on
> >> >>>> HDFS. It is a waste of resources.
> >> >>>>
> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
> >> >>>>
> >> >>>> KafkaAgent.sources = kafka-sources
> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
> >> >>>>
> >> >>>> That will be for your batch layer. To analyse you can directly read
> >> >>>> from
> >> >>>> hdfs files with Spark or simply store data in a database of your
> >> >>>> choice via
> >> >>>> cron or something. Do not mix your batch layer with speed layer.
> >> >>>>
> >> >>>> Your speed layer will ingest the same data directly from Kafka into
> >> >>>> spark streaming and that will be  online or near real time (defined
> >> >>>> by your
> >> >>>> window).
> >> >>>>
> >> >>>> Then you have a a serving layer to present data from both speed
> (the
> >> >>>> one from SS) and batch layer.
> >> >>>>
> >> >>>> HTH
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> Dr Mich Talebzadeh
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> LinkedIn
> >> >>>>
> >> >>>> https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> http://talebzadehmich.wordpress.com
> >> >>>>
> >> >>>>
> >> >>>> Disclaimer: Use it at your own risk. Any and all responsibility for
> >> >>>> any
> >> >>>> loss, damage or destruction of data or any other property which may
> >> >>>> arise
> >> >>>> from relying on this email's technical content is explicitly
> >> >>>> disclaimed. The
> >> >>>> author will in no case be liable for any monetary damages arising
> >> >>>> from such
> >> >>>> loss, damage or destruction.
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com>
> >> >>>> wrote:
> >> >>>>>
> >> >>>>> The web UI is actually the speed layer, it needs to be able to
> query
> >> >>>>> the data online, and show the results in real-time.
> >> >>>>>
> >> >>>>> It also needs a custom front-end, so a system like Tableau can't
> be
> >> >>>>> used, it must have a custom backend + front-end.
> >> >>>>>
> >> >>>>> Thanks for the recommendation of Flume. Do you think this will
> work:
> >> >>>>>
> >> >>>>> - Spark Streaming to read data from Kafka
> >> >>>>> - Storing the data on HDFS using Flume
> >> >>>>> - Using Spark to query the data in the backend of the web UI?
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
> >> >>>>> <mi...@gmail.com> wrote:
> >> >>>>>>
> >> >>>>>> You need a batch layer and a speed layer. Data from Kafka can be
> >> >>>>>> stored on HDFS using flume.
> >> >>>>>>
> >> >>>>>> -  Query this data to generate reports / analytics (There will
> be a
> >> >>>>>> web UI which will be the front-end to the data, and will show the
> >> >>>>>> reports)
> >> >>>>>>
> >> >>>>>> This is basically batch layer and you need something like Tableau
> >> >>>>>> or
> >> >>>>>> Zeppelin to query data
> >> >>>>>>
> >> >>>>>> You will also need spark streaming to query data online for speed
> >> >>>>>> layer. That data could be stored in some transient fabric like
> >> >>>>>> ignite or
> >> >>>>>> even druid.
> >> >>>>>>
> >> >>>>>> HTH
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Dr Mich Talebzadeh
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> LinkedIn
> >> >>>>>>
> >> >>>>>> https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> http://talebzadehmich.wordpress.com
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility
> for
> >> >>>>>> any loss, damage or destruction of data or any other property
> which
> >> >>>>>> may
> >> >>>>>> arise from relying on this email's technical content is
> explicitly
> >> >>>>>> disclaimed. The author will in no case be liable for any monetary
> >> >>>>>> damages
> >> >>>>>> arising from such loss, damage or destruction.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com>
> >> >>>>>> wrote:
> >> >>>>>>>
> >> >>>>>>> It needs to be able to scale to a very large amount of data,
> yes.
> >> >>>>>>>
> >> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
> >> >>>>>>> <de...@gmail.com> wrote:
> >> >>>>>>>>
> >> >>>>>>>> What is the message inflow ?
> >> >>>>>>>> If it's really high , definitely spark will be of great use .
> >> >>>>>>>>
> >> >>>>>>>> Thanks
> >> >>>>>>>> Deepak
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com>
> wrote:
> >> >>>>>>>>>
> >> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
> >> >>>>>>>>>
> >> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing
> >> >>>>>>>>> their
> >> >>>>>>>>> raw data into Kafka.
> >> >>>>>>>>>
> >> >>>>>>>>> I need to:
> >> >>>>>>>>>
> >> >>>>>>>>> - Do ETL on the data, and standardize it.
> >> >>>>>>>>>
> >> >>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra /
> Raw
> >> >>>>>>>>> HDFS / ElasticSearch / Postgres)
> >> >>>>>>>>>
> >> >>>>>>>>> - Query this data to generate reports / analytics (There will
> be
> >> >>>>>>>>> a
> >> >>>>>>>>> web UI which will be the front-end to the data, and will show
> >> >>>>>>>>> the reports)
> >> >>>>>>>>>
> >> >>>>>>>>> Java is being used as the backend language for everything
> >> >>>>>>>>> (backend
> >> >>>>>>>>> of the web UI, as well as the ETL layer)
> >> >>>>>>>>>
> >> >>>>>>>>> I'm considering:
> >> >>>>>>>>>
> >> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL
> >> >>>>>>>>> layer
> >> >>>>>>>>> (receive raw data from Kafka, standardize & store it)
> >> >>>>>>>>>
> >> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the
> >> >>>>>>>>> standardized
> >> >>>>>>>>> data, and to allow queries
> >> >>>>>>>>>
> >> >>>>>>>>> - In the backend of the web UI, I could either use Spark to
> run
> >> >>>>>>>>> queries across the data (mostly filters), or directly run
> >> >>>>>>>>> queries against
> >> >>>>>>>>> Cassandra / HBase
> >> >>>>>>>>>
> >> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of these
> >> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers
> vs
> >> >>>>>>>>> Spark for
> >> >>>>>>>>> ETL, which persistent data store to use, and how to query that
> >> >>>>>>>>> data store in
> >> >>>>>>>>> the backend of the web UI, for displaying the reports).
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> Thanks.
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>>
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Thanks
> >> >> Deepak
> >> >> www.bigdatabig.com
> >> >> www.keosha.net
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >>
> >
> >
> >
> > --
> > Thanks
> > Deepak
> > www.bigdatabig.com
> > www.keosha.net
>

Re: Architecture recommendations for a tricky use case

Posted by Mich Talebzadeh <mi...@gmail.com>.

Yes but still these writes from Spark have  to go through JDBC? Correct.

Having said that I don't see how doing this through Spark streaming to
postgress is going to be faster than source -> Kafka - flume via zookeeper
-> HDFS.

I believe there is direct streaming from Kakfa to Hive as well and from
Flume to Hbase

I would have thought that if one wanted to do real time analytics with SS,
then that would be a good fit with a real time dashboard.

What is not so clear is the business use case for this.

HTH




Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 16:28, Cody Koeninger <co...@koeninger.org> wrote:

> I wouldn't give up the flexibility and maturity of a relational
> database, unless you have a very specific use case.  I'm not trashing
> cassandra, I've used cassandra, but if all I know is that you're doing
> analytics, I wouldn't want to give up the ability to easily do ad-hoc
> aggregations without a lot of forethought.  If you're worried about
> scaling, there are several options for horizontally scaling Postgres
> in particular.  One of the current best from what I've worked with is
> Citus.
>
> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <de...@gmail.com>
> wrote:
> > Hi Cody
> > Spark direct stream is just fine for this use case.
> > But why postgres and not cassandra?
> > Is there anything specific here that i may not be aware?
> >
> > Thanks
> > Deepak
> >
> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <co...@koeninger.org>
> wrote:
> >>
> >> How are you going to handle etl failures?  Do you care about lost /
> >> duplicated data?  Are your writes idempotent?
> >>
> >> Absent any other information about the problem, I'd stay away from
> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
> >> feeding postgres.
> >>
> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <al...@gmail.com>
> wrote:
> >> > Is there an advantage to that vs directly consuming from Kafka?
> Nothing
> >> > is
> >> > being done to the data except some light ETL and then storing it in
> >> > Cassandra
> >> >
> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <deepakmca05@gmail.com
> >
> >> > wrote:
> >> >>
> >> >> Its better you use spark's direct stream to ingest from kafka.
> >> >>
> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <al...@gmail.com>
> >> >> wrote:
> >> >>>
> >> >>> I don't think I need a different speed storage and batch storage.
> Just
> >> >>> taking in raw data from Kafka, standardizing, and storing it
> somewhere
> >> >>> where
> >> >>> the web UI can query it, seems like it will be enough.
> >> >>>
> >> >>> I'm thinking about:
> >> >>>
> >> >>> - Reading data from Kafka via Spark Streaming
> >> >>> - Standardizing, then storing it in Cassandra
> >> >>> - Querying Cassandra from the web ui
> >> >>>
> >> >>> That seems like it will work. My question now is whether to use
> Spark
> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
> >> >>>
> >> >>>
> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
> >> >>> <mi...@gmail.com> wrote:
> >> >>>>
> >> >>>> - Spark Streaming to read data from Kafka
> >> >>>> - Storing the data on HDFS using Flume
> >> >>>>
> >> >>>> You don't need Spark streaming to read data from Kafka and store on
> >> >>>> HDFS. It is a waste of resources.
> >> >>>>
> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
> >> >>>>
> >> >>>> KafkaAgent.sources = kafka-sources
> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
> >> >>>>
> >> >>>> That will be for your batch layer. To analyse you can directly read
> >> >>>> from
> >> >>>> hdfs files with Spark or simply store data in a database of your
> >> >>>> choice via
> >> >>>> cron or something. Do not mix your batch layer with speed layer.
> >> >>>>
> >> >>>> Your speed layer will ingest the same data directly from Kafka into
> >> >>>> spark streaming and that will be  online or near real time (defined
> >> >>>> by your
> >> >>>> window).
> >> >>>>
> >> >>>> Then you have a a serving layer to present data from both speed
> (the
> >> >>>> one from SS) and batch layer.
> >> >>>>
> >> >>>> HTH
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> Dr Mich Talebzadeh
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> LinkedIn
> >> >>>>
> >> >>>> https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> http://talebzadehmich.wordpress.com
> >> >>>>
> >> >>>>
> >> >>>> Disclaimer: Use it at your own risk. Any and all responsibility for
> >> >>>> any
> >> >>>> loss, damage or destruction of data or any other property which may
> >> >>>> arise
> >> >>>> from relying on this email's technical content is explicitly
> >> >>>> disclaimed. The
> >> >>>> author will in no case be liable for any monetary damages arising
> >> >>>> from such
> >> >>>> loss, damage or destruction.
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com>
> >> >>>> wrote:
> >> >>>>>
> >> >>>>> The web UI is actually the speed layer, it needs to be able to
> query
> >> >>>>> the data online, and show the results in real-time.
> >> >>>>>
> >> >>>>> It also needs a custom front-end, so a system like Tableau can't
> be
> >> >>>>> used, it must have a custom backend + front-end.
> >> >>>>>
> >> >>>>> Thanks for the recommendation of Flume. Do you think this will
> work:
> >> >>>>>
> >> >>>>> - Spark Streaming to read data from Kafka
> >> >>>>> - Storing the data on HDFS using Flume
> >> >>>>> - Using Spark to query the data in the backend of the web UI?
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
> >> >>>>> <mi...@gmail.com> wrote:
> >> >>>>>>
> >> >>>>>> You need a batch layer and a speed layer. Data from Kafka can be
> >> >>>>>> stored on HDFS using flume.
> >> >>>>>>
> >> >>>>>> -  Query this data to generate reports / analytics (There will
> be a
> >> >>>>>> web UI which will be the front-end to the data, and will show the
> >> >>>>>> reports)
> >> >>>>>>
> >> >>>>>> This is basically batch layer and you need something like Tableau
> >> >>>>>> or
> >> >>>>>> Zeppelin to query data
> >> >>>>>>
> >> >>>>>> You will also need spark streaming to query data online for speed
> >> >>>>>> layer. That data could be stored in some transient fabric like
> >> >>>>>> ignite or
> >> >>>>>> even druid.
> >> >>>>>>
> >> >>>>>> HTH
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Dr Mich Talebzadeh
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> LinkedIn
> >> >>>>>>
> >> >>>>>> https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> http://talebzadehmich.wordpress.com
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility
> for
> >> >>>>>> any loss, damage or destruction of data or any other property
> which
> >> >>>>>> may
> >> >>>>>> arise from relying on this email's technical content is
> explicitly
> >> >>>>>> disclaimed. The author will in no case be liable for any monetary
> >> >>>>>> damages
> >> >>>>>> arising from such loss, damage or destruction.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com>
> >> >>>>>> wrote:
> >> >>>>>>>
> >> >>>>>>> It needs to be able to scale to a very large amount of data,
> yes.
> >> >>>>>>>
> >> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
> >> >>>>>>> <de...@gmail.com> wrote:
> >> >>>>>>>>
> >> >>>>>>>> What is the message inflow ?
> >> >>>>>>>> If it's really high , definitely spark will be of great use .
> >> >>>>>>>>
> >> >>>>>>>> Thanks
> >> >>>>>>>> Deepak
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com>
> wrote:
> >> >>>>>>>>>
> >> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
> >> >>>>>>>>>
> >> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing
> >> >>>>>>>>> their
> >> >>>>>>>>> raw data into Kafka.
> >> >>>>>>>>>
> >> >>>>>>>>> I need to:
> >> >>>>>>>>>
> >> >>>>>>>>> - Do ETL on the data, and standardize it.
> >> >>>>>>>>>
> >> >>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra /
> Raw
> >> >>>>>>>>> HDFS / ElasticSearch / Postgres)
> >> >>>>>>>>>
> >> >>>>>>>>> - Query this data to generate reports / analytics (There will
> be
> >> >>>>>>>>> a
> >> >>>>>>>>> web UI which will be the front-end to the data, and will show
> >> >>>>>>>>> the reports)
> >> >>>>>>>>>
> >> >>>>>>>>> Java is being used as the backend language for everything
> >> >>>>>>>>> (backend
> >> >>>>>>>>> of the web UI, as well as the ETL layer)
> >> >>>>>>>>>
> >> >>>>>>>>> I'm considering:
> >> >>>>>>>>>
> >> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL
> >> >>>>>>>>> layer
> >> >>>>>>>>> (receive raw data from Kafka, standardize & store it)
> >> >>>>>>>>>
> >> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the
> >> >>>>>>>>> standardized
> >> >>>>>>>>> data, and to allow queries
> >> >>>>>>>>>
> >> >>>>>>>>> - In the backend of the web UI, I could either use Spark to
> run
> >> >>>>>>>>> queries across the data (mostly filters), or directly run
> >> >>>>>>>>> queries against
> >> >>>>>>>>> Cassandra / HBase
> >> >>>>>>>>>
> >> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of these
> >> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers
> vs
> >> >>>>>>>>> Spark for
> >> >>>>>>>>> ETL, which persistent data store to use, and how to query that
> >> >>>>>>>>> data store in
> >> >>>>>>>>> the backend of the web UI, for displaying the reports).
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> Thanks.
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>>
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Thanks
> >> >> Deepak
> >> >> www.bigdatabig.com
> >> >> www.keosha.net
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >>
> >
> >
> >
> > --
> > Thanks
> > Deepak
> > www.bigdatabig.com
> > www.keosha.net
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Architecture recommendations for a tricky use case

Posted by Mich Talebzadeh <mi...@gmail.com>.

Yes but still these writes from Spark have  to go through JDBC? Correct.

Having said that I don't see how doing this through Spark streaming to
postgress is going to be faster than source -> Kafka - flume via zookeeper
-> HDFS.

I believe there is direct streaming from Kakfa to Hive as well and from
Flume to Hbase

I would have thought that if one wanted to do real time analytics with SS,
then that would be a good fit with a real time dashboard.

What is not so clear is the business use case for this.

HTH




Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 16:28, Cody Koeninger <co...@koeninger.org> wrote:

> I wouldn't give up the flexibility and maturity of a relational
> database, unless you have a very specific use case.  I'm not trashing
> cassandra, I've used cassandra, but if all I know is that you're doing
> analytics, I wouldn't want to give up the ability to easily do ad-hoc
> aggregations without a lot of forethought.  If you're worried about
> scaling, there are several options for horizontally scaling Postgres
> in particular.  One of the current best from what I've worked with is
> Citus.
>
> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <de...@gmail.com>
> wrote:
> > Hi Cody
> > Spark direct stream is just fine for this use case.
> > But why postgres and not cassandra?
> > Is there anything specific here that i may not be aware?
> >
> > Thanks
> > Deepak
> >
> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <co...@koeninger.org>
> wrote:
> >>
> >> How are you going to handle etl failures?  Do you care about lost /
> >> duplicated data?  Are your writes idempotent?
> >>
> >> Absent any other information about the problem, I'd stay away from
> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
> >> feeding postgres.
> >>
> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <al...@gmail.com>
> wrote:
> >> > Is there an advantage to that vs directly consuming from Kafka?
> Nothing
> >> > is
> >> > being done to the data except some light ETL and then storing it in
> >> > Cassandra
> >> >
> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <deepakmca05@gmail.com
> >
> >> > wrote:
> >> >>
> >> >> Its better you use spark's direct stream to ingest from kafka.
> >> >>
> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <al...@gmail.com>
> >> >> wrote:
> >> >>>
> >> >>> I don't think I need a different speed storage and batch storage.
> Just
> >> >>> taking in raw data from Kafka, standardizing, and storing it
> somewhere
> >> >>> where
> >> >>> the web UI can query it, seems like it will be enough.
> >> >>>
> >> >>> I'm thinking about:
> >> >>>
> >> >>> - Reading data from Kafka via Spark Streaming
> >> >>> - Standardizing, then storing it in Cassandra
> >> >>> - Querying Cassandra from the web ui
> >> >>>
> >> >>> That seems like it will work. My question now is whether to use
> Spark
> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
> >> >>>
> >> >>>
> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
> >> >>> <mi...@gmail.com> wrote:
> >> >>>>
> >> >>>> - Spark Streaming to read data from Kafka
> >> >>>> - Storing the data on HDFS using Flume
> >> >>>>
> >> >>>> You don't need Spark streaming to read data from Kafka and store on
> >> >>>> HDFS. It is a waste of resources.
> >> >>>>
> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
> >> >>>>
> >> >>>> KafkaAgent.sources = kafka-sources
> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
> >> >>>>
> >> >>>> That will be for your batch layer. To analyse you can directly read
> >> >>>> from
> >> >>>> hdfs files with Spark or simply store data in a database of your
> >> >>>> choice via
> >> >>>> cron or something. Do not mix your batch layer with speed layer.
> >> >>>>
> >> >>>> Your speed layer will ingest the same data directly from Kafka into
> >> >>>> spark streaming and that will be  online or near real time (defined
> >> >>>> by your
> >> >>>> window).
> >> >>>>
> >> >>>> Then you have a a serving layer to present data from both speed
> (the
> >> >>>> one from SS) and batch layer.
> >> >>>>
> >> >>>> HTH
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> Dr Mich Talebzadeh
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> LinkedIn
> >> >>>>
> >> >>>> https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> http://talebzadehmich.wordpress.com
> >> >>>>
> >> >>>>
> >> >>>> Disclaimer: Use it at your own risk. Any and all responsibility for
> >> >>>> any
> >> >>>> loss, damage or destruction of data or any other property which may
> >> >>>> arise
> >> >>>> from relying on this email's technical content is explicitly
> >> >>>> disclaimed. The
> >> >>>> author will in no case be liable for any monetary damages arising
> >> >>>> from such
> >> >>>> loss, damage or destruction.
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com>
> >> >>>> wrote:
> >> >>>>>
> >> >>>>> The web UI is actually the speed layer, it needs to be able to
> query
> >> >>>>> the data online, and show the results in real-time.
> >> >>>>>
> >> >>>>> It also needs a custom front-end, so a system like Tableau can't
> be
> >> >>>>> used, it must have a custom backend + front-end.
> >> >>>>>
> >> >>>>> Thanks for the recommendation of Flume. Do you think this will
> work:
> >> >>>>>
> >> >>>>> - Spark Streaming to read data from Kafka
> >> >>>>> - Storing the data on HDFS using Flume
> >> >>>>> - Using Spark to query the data in the backend of the web UI?
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
> >> >>>>> <mi...@gmail.com> wrote:
> >> >>>>>>
> >> >>>>>> You need a batch layer and a speed layer. Data from Kafka can be
> >> >>>>>> stored on HDFS using flume.
> >> >>>>>>
> >> >>>>>> -  Query this data to generate reports / analytics (There will
> be a
> >> >>>>>> web UI which will be the front-end to the data, and will show the
> >> >>>>>> reports)
> >> >>>>>>
> >> >>>>>> This is basically batch layer and you need something like Tableau
> >> >>>>>> or
> >> >>>>>> Zeppelin to query data
> >> >>>>>>
> >> >>>>>> You will also need spark streaming to query data online for speed
> >> >>>>>> layer. That data could be stored in some transient fabric like
> >> >>>>>> ignite or
> >> >>>>>> even druid.
> >> >>>>>>
> >> >>>>>> HTH
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Dr Mich Talebzadeh
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> LinkedIn
> >> >>>>>>
> >> >>>>>> https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> http://talebzadehmich.wordpress.com
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility
> for
> >> >>>>>> any loss, damage or destruction of data or any other property
> which
> >> >>>>>> may
> >> >>>>>> arise from relying on this email's technical content is
> explicitly
> >> >>>>>> disclaimed. The author will in no case be liable for any monetary
> >> >>>>>> damages
> >> >>>>>> arising from such loss, damage or destruction.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com>
> >> >>>>>> wrote:
> >> >>>>>>>
> >> >>>>>>> It needs to be able to scale to a very large amount of data,
> yes.
> >> >>>>>>>
> >> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
> >> >>>>>>> <de...@gmail.com> wrote:
> >> >>>>>>>>
> >> >>>>>>>> What is the message inflow ?
> >> >>>>>>>> If it's really high , definitely spark will be of great use .
> >> >>>>>>>>
> >> >>>>>>>> Thanks
> >> >>>>>>>> Deepak
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com>
> wrote:
> >> >>>>>>>>>
> >> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
> >> >>>>>>>>>
> >> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing
> >> >>>>>>>>> their
> >> >>>>>>>>> raw data into Kafka.
> >> >>>>>>>>>
> >> >>>>>>>>> I need to:
> >> >>>>>>>>>
> >> >>>>>>>>> - Do ETL on the data, and standardize it.
> >> >>>>>>>>>
> >> >>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra /
> Raw
> >> >>>>>>>>> HDFS / ElasticSearch / Postgres)
> >> >>>>>>>>>
> >> >>>>>>>>> - Query this data to generate reports / analytics (There will
> be
> >> >>>>>>>>> a
> >> >>>>>>>>> web UI which will be the front-end to the data, and will show
> >> >>>>>>>>> the reports)
> >> >>>>>>>>>
> >> >>>>>>>>> Java is being used as the backend language for everything
> >> >>>>>>>>> (backend
> >> >>>>>>>>> of the web UI, as well as the ETL layer)
> >> >>>>>>>>>
> >> >>>>>>>>> I'm considering:
> >> >>>>>>>>>
> >> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL
> >> >>>>>>>>> layer
> >> >>>>>>>>> (receive raw data from Kafka, standardize & store it)
> >> >>>>>>>>>
> >> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the
> >> >>>>>>>>> standardized
> >> >>>>>>>>> data, and to allow queries
> >> >>>>>>>>>
> >> >>>>>>>>> - In the backend of the web UI, I could either use Spark to
> run
> >> >>>>>>>>> queries across the data (mostly filters), or directly run
> >> >>>>>>>>> queries against
> >> >>>>>>>>> Cassandra / HBase
> >> >>>>>>>>>
> >> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of these
> >> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers
> vs
> >> >>>>>>>>> Spark for
> >> >>>>>>>>> ETL, which persistent data store to use, and how to query that
> >> >>>>>>>>> data store in
> >> >>>>>>>>> the backend of the web UI, for displaying the reports).
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> Thanks.
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>>
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Thanks
> >> >> Deepak
> >> >> www.bigdatabig.com
> >> >> www.keosha.net
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >>
> >
> >
> >
> > --
> > Thanks
> > Deepak
> > www.bigdatabig.com
> > www.keosha.net
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Architecture recommendations for a tricky use case

Posted by Cody Koeninger <co...@koeninger.org>.

I wouldn't give up the flexibility and maturity of a relational
database, unless you have a very specific use case.  I'm not trashing
cassandra, I've used cassandra, but if all I know is that you're doing
analytics, I wouldn't want to give up the ability to easily do ad-hoc
aggregations without a lot of forethought.  If you're worried about
scaling, there are several options for horizontally scaling Postgres
in particular.  One of the current best from what I've worked with is
Citus.

On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <de...@gmail.com> wrote:
> Hi Cody
> Spark direct stream is just fine for this use case.
> But why postgres and not cassandra?
> Is there anything specific here that i may not be aware?
>
> Thanks
> Deepak
>
> On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <co...@koeninger.org> wrote:
>>
>> How are you going to handle etl failures?  Do you care about lost /
>> duplicated data?  Are your writes idempotent?
>>
>> Absent any other information about the problem, I'd stay away from
>> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>> feeding postgres.
>>
>> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <al...@gmail.com> wrote:
>> > Is there an advantage to that vs directly consuming from Kafka? Nothing
>> > is
>> > being done to the data except some light ETL and then storing it in
>> > Cassandra
>> >
>> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <de...@gmail.com>
>> > wrote:
>> >>
>> >> Its better you use spark's direct stream to ingest from kafka.
>> >>
>> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <al...@gmail.com>
>> >> wrote:
>> >>>
>> >>> I don't think I need a different speed storage and batch storage. Just
>> >>> taking in raw data from Kafka, standardizing, and storing it somewhere
>> >>> where
>> >>> the web UI can query it, seems like it will be enough.
>> >>>
>> >>> I'm thinking about:
>> >>>
>> >>> - Reading data from Kafka via Spark Streaming
>> >>> - Standardizing, then storing it in Cassandra
>> >>> - Querying Cassandra from the web ui
>> >>>
>> >>> That seems like it will work. My question now is whether to use Spark
>> >>> Streaming to read Kafka, or use Kafka consumers directly.
>> >>>
>> >>>
>> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>> >>> <mi...@gmail.com> wrote:
>> >>>>
>> >>>> - Spark Streaming to read data from Kafka
>> >>>> - Storing the data on HDFS using Flume
>> >>>>
>> >>>> You don't need Spark streaming to read data from Kafka and store on
>> >>>> HDFS. It is a waste of resources.
>> >>>>
>> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
>> >>>>
>> >>>> KafkaAgent.sources = kafka-sources
>> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>> >>>>
>> >>>> That will be for your batch layer. To analyse you can directly read
>> >>>> from
>> >>>> hdfs files with Spark or simply store data in a database of your
>> >>>> choice via
>> >>>> cron or something. Do not mix your batch layer with speed layer.
>> >>>>
>> >>>> Your speed layer will ingest the same data directly from Kafka into
>> >>>> spark streaming and that will be  online or near real time (defined
>> >>>> by your
>> >>>> window).
>> >>>>
>> >>>> Then you have a a serving layer to present data from both speed  (the
>> >>>> one from SS) and batch layer.
>> >>>>
>> >>>> HTH
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> Dr Mich Talebzadeh
>> >>>>
>> >>>>
>> >>>>
>> >>>> LinkedIn
>> >>>>
>> >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >>>>
>> >>>>
>> >>>>
>> >>>> http://talebzadehmich.wordpress.com
>> >>>>
>> >>>>
>> >>>> Disclaimer: Use it at your own risk. Any and all responsibility for
>> >>>> any
>> >>>> loss, damage or destruction of data or any other property which may
>> >>>> arise
>> >>>> from relying on this email's technical content is explicitly
>> >>>> disclaimed. The
>> >>>> author will in no case be liable for any monetary damages arising
>> >>>> from such
>> >>>> loss, damage or destruction.
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com>
>> >>>> wrote:
>> >>>>>
>> >>>>> The web UI is actually the speed layer, it needs to be able to query
>> >>>>> the data online, and show the results in real-time.
>> >>>>>
>> >>>>> It also needs a custom front-end, so a system like Tableau can't be
>> >>>>> used, it must have a custom backend + front-end.
>> >>>>>
>> >>>>> Thanks for the recommendation of Flume. Do you think this will work:
>> >>>>>
>> >>>>> - Spark Streaming to read data from Kafka
>> >>>>> - Storing the data on HDFS using Flume
>> >>>>> - Using Spark to query the data in the backend of the web UI?
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
>> >>>>> <mi...@gmail.com> wrote:
>> >>>>>>
>> >>>>>> You need a batch layer and a speed layer. Data from Kafka can be
>> >>>>>> stored on HDFS using flume.
>> >>>>>>
>> >>>>>> -  Query this data to generate reports / analytics (There will be a
>> >>>>>> web UI which will be the front-end to the data, and will show the
>> >>>>>> reports)
>> >>>>>>
>> >>>>>> This is basically batch layer and you need something like Tableau
>> >>>>>> or
>> >>>>>> Zeppelin to query data
>> >>>>>>
>> >>>>>> You will also need spark streaming to query data online for speed
>> >>>>>> layer. That data could be stored in some transient fabric like
>> >>>>>> ignite or
>> >>>>>> even druid.
>> >>>>>>
>> >>>>>> HTH
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> Dr Mich Talebzadeh
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> LinkedIn
>> >>>>>>
>> >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> http://talebzadehmich.wordpress.com
>> >>>>>>
>> >>>>>>
>> >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for
>> >>>>>> any loss, damage or destruction of data or any other property which
>> >>>>>> may
>> >>>>>> arise from relying on this email's technical content is explicitly
>> >>>>>> disclaimed. The author will in no case be liable for any monetary
>> >>>>>> damages
>> >>>>>> arising from such loss, damage or destruction.
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com>
>> >>>>>> wrote:
>> >>>>>>>
>> >>>>>>> It needs to be able to scale to a very large amount of data, yes.
>> >>>>>>>
>> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
>> >>>>>>> <de...@gmail.com> wrote:
>> >>>>>>>>
>> >>>>>>>> What is the message inflow ?
>> >>>>>>>> If it's really high , definitely spark will be of great use .
>> >>>>>>>>
>> >>>>>>>> Thanks
>> >>>>>>>> Deepak
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:
>> >>>>>>>>>
>> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>> >>>>>>>>>
>> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing
>> >>>>>>>>> their
>> >>>>>>>>> raw data into Kafka.
>> >>>>>>>>>
>> >>>>>>>>> I need to:
>> >>>>>>>>>
>> >>>>>>>>> - Do ETL on the data, and standardize it.
>> >>>>>>>>>
>> >>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw
>> >>>>>>>>> HDFS / ElasticSearch / Postgres)
>> >>>>>>>>>
>> >>>>>>>>> - Query this data to generate reports / analytics (There will be
>> >>>>>>>>> a
>> >>>>>>>>> web UI which will be the front-end to the data, and will show
>> >>>>>>>>> the reports)
>> >>>>>>>>>
>> >>>>>>>>> Java is being used as the backend language for everything
>> >>>>>>>>> (backend
>> >>>>>>>>> of the web UI, as well as the ETL layer)
>> >>>>>>>>>
>> >>>>>>>>> I'm considering:
>> >>>>>>>>>
>> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL
>> >>>>>>>>> layer
>> >>>>>>>>> (receive raw data from Kafka, standardize & store it)
>> >>>>>>>>>
>> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the
>> >>>>>>>>> standardized
>> >>>>>>>>> data, and to allow queries
>> >>>>>>>>>
>> >>>>>>>>> - In the backend of the web UI, I could either use Spark to run
>> >>>>>>>>> queries across the data (mostly filters), or directly run
>> >>>>>>>>> queries against
>> >>>>>>>>> Cassandra / HBase
>> >>>>>>>>>
>> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of these
>> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs
>> >>>>>>>>> Spark for
>> >>>>>>>>> ETL, which persistent data store to use, and how to query that
>> >>>>>>>>> data store in
>> >>>>>>>>> the backend of the web UI, for displaying the reports).
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> Thanks.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Thanks
>> >> Deepak
>> >> www.bigdatabig.com
>> >> www.keosha.net
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Architecture recommendations for a tricky use case

Posted by Cody Koeninger <co...@koeninger.org>.

I wouldn't give up the flexibility and maturity of a relational
database, unless you have a very specific use case.  I'm not trashing
cassandra, I've used cassandra, but if all I know is that you're doing
analytics, I wouldn't want to give up the ability to easily do ad-hoc
aggregations without a lot of forethought.  If you're worried about
scaling, there are several options for horizontally scaling Postgres
in particular.  One of the current best from what I've worked with is
Citus.

On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <de...@gmail.com> wrote:
> Hi Cody
> Spark direct stream is just fine for this use case.
> But why postgres and not cassandra?
> Is there anything specific here that i may not be aware?
>
> Thanks
> Deepak
>
> On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <co...@koeninger.org> wrote:
>>
>> How are you going to handle etl failures?  Do you care about lost /
>> duplicated data?  Are your writes idempotent?
>>
>> Absent any other information about the problem, I'd stay away from
>> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>> feeding postgres.
>>
>> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <al...@gmail.com> wrote:
>> > Is there an advantage to that vs directly consuming from Kafka? Nothing
>> > is
>> > being done to the data except some light ETL and then storing it in
>> > Cassandra
>> >
>> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <de...@gmail.com>
>> > wrote:
>> >>
>> >> Its better you use spark's direct stream to ingest from kafka.
>> >>
>> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <al...@gmail.com>
>> >> wrote:
>> >>>
>> >>> I don't think I need a different speed storage and batch storage. Just
>> >>> taking in raw data from Kafka, standardizing, and storing it somewhere
>> >>> where
>> >>> the web UI can query it, seems like it will be enough.
>> >>>
>> >>> I'm thinking about:
>> >>>
>> >>> - Reading data from Kafka via Spark Streaming
>> >>> - Standardizing, then storing it in Cassandra
>> >>> - Querying Cassandra from the web ui
>> >>>
>> >>> That seems like it will work. My question now is whether to use Spark
>> >>> Streaming to read Kafka, or use Kafka consumers directly.
>> >>>
>> >>>
>> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>> >>> <mi...@gmail.com> wrote:
>> >>>>
>> >>>> - Spark Streaming to read data from Kafka
>> >>>> - Storing the data on HDFS using Flume
>> >>>>
>> >>>> You don't need Spark streaming to read data from Kafka and store on
>> >>>> HDFS. It is a waste of resources.
>> >>>>
>> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
>> >>>>
>> >>>> KafkaAgent.sources = kafka-sources
>> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>> >>>>
>> >>>> That will be for your batch layer. To analyse you can directly read
>> >>>> from
>> >>>> hdfs files with Spark or simply store data in a database of your
>> >>>> choice via
>> >>>> cron or something. Do not mix your batch layer with speed layer.
>> >>>>
>> >>>> Your speed layer will ingest the same data directly from Kafka into
>> >>>> spark streaming and that will be  online or near real time (defined
>> >>>> by your
>> >>>> window).
>> >>>>
>> >>>> Then you have a a serving layer to present data from both speed  (the
>> >>>> one from SS) and batch layer.
>> >>>>
>> >>>> HTH
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> Dr Mich Talebzadeh
>> >>>>
>> >>>>
>> >>>>
>> >>>> LinkedIn
>> >>>>
>> >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >>>>
>> >>>>
>> >>>>
>> >>>> http://talebzadehmich.wordpress.com
>> >>>>
>> >>>>
>> >>>> Disclaimer: Use it at your own risk. Any and all responsibility for
>> >>>> any
>> >>>> loss, damage or destruction of data or any other property which may
>> >>>> arise
>> >>>> from relying on this email's technical content is explicitly
>> >>>> disclaimed. The
>> >>>> author will in no case be liable for any monetary damages arising
>> >>>> from such
>> >>>> loss, damage or destruction.
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com>
>> >>>> wrote:
>> >>>>>
>> >>>>> The web UI is actually the speed layer, it needs to be able to query
>> >>>>> the data online, and show the results in real-time.
>> >>>>>
>> >>>>> It also needs a custom front-end, so a system like Tableau can't be
>> >>>>> used, it must have a custom backend + front-end.
>> >>>>>
>> >>>>> Thanks for the recommendation of Flume. Do you think this will work:
>> >>>>>
>> >>>>> - Spark Streaming to read data from Kafka
>> >>>>> - Storing the data on HDFS using Flume
>> >>>>> - Using Spark to query the data in the backend of the web UI?
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
>> >>>>> <mi...@gmail.com> wrote:
>> >>>>>>
>> >>>>>> You need a batch layer and a speed layer. Data from Kafka can be
>> >>>>>> stored on HDFS using flume.
>> >>>>>>
>> >>>>>> -  Query this data to generate reports / analytics (There will be a
>> >>>>>> web UI which will be the front-end to the data, and will show the
>> >>>>>> reports)
>> >>>>>>
>> >>>>>> This is basically batch layer and you need something like Tableau
>> >>>>>> or
>> >>>>>> Zeppelin to query data
>> >>>>>>
>> >>>>>> You will also need spark streaming to query data online for speed
>> >>>>>> layer. That data could be stored in some transient fabric like
>> >>>>>> ignite or
>> >>>>>> even druid.
>> >>>>>>
>> >>>>>> HTH
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> Dr Mich Talebzadeh
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> LinkedIn
>> >>>>>>
>> >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> http://talebzadehmich.wordpress.com
>> >>>>>>
>> >>>>>>
>> >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for
>> >>>>>> any loss, damage or destruction of data or any other property which
>> >>>>>> may
>> >>>>>> arise from relying on this email's technical content is explicitly
>> >>>>>> disclaimed. The author will in no case be liable for any monetary
>> >>>>>> damages
>> >>>>>> arising from such loss, damage or destruction.
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com>
>> >>>>>> wrote:
>> >>>>>>>
>> >>>>>>> It needs to be able to scale to a very large amount of data, yes.
>> >>>>>>>
>> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
>> >>>>>>> <de...@gmail.com> wrote:
>> >>>>>>>>
>> >>>>>>>> What is the message inflow ?
>> >>>>>>>> If it's really high , definitely spark will be of great use .
>> >>>>>>>>
>> >>>>>>>> Thanks
>> >>>>>>>> Deepak
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:
>> >>>>>>>>>
>> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>> >>>>>>>>>
>> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing
>> >>>>>>>>> their
>> >>>>>>>>> raw data into Kafka.
>> >>>>>>>>>
>> >>>>>>>>> I need to:
>> >>>>>>>>>
>> >>>>>>>>> - Do ETL on the data, and standardize it.
>> >>>>>>>>>
>> >>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw
>> >>>>>>>>> HDFS / ElasticSearch / Postgres)
>> >>>>>>>>>
>> >>>>>>>>> - Query this data to generate reports / analytics (There will be
>> >>>>>>>>> a
>> >>>>>>>>> web UI which will be the front-end to the data, and will show
>> >>>>>>>>> the reports)
>> >>>>>>>>>
>> >>>>>>>>> Java is being used as the backend language for everything
>> >>>>>>>>> (backend
>> >>>>>>>>> of the web UI, as well as the ETL layer)
>> >>>>>>>>>
>> >>>>>>>>> I'm considering:
>> >>>>>>>>>
>> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL
>> >>>>>>>>> layer
>> >>>>>>>>> (receive raw data from Kafka, standardize & store it)
>> >>>>>>>>>
>> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the
>> >>>>>>>>> standardized
>> >>>>>>>>> data, and to allow queries
>> >>>>>>>>>
>> >>>>>>>>> - In the backend of the web UI, I could either use Spark to run
>> >>>>>>>>> queries across the data (mostly filters), or directly run
>> >>>>>>>>> queries against
>> >>>>>>>>> Cassandra / HBase
>> >>>>>>>>>
>> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of these
>> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs
>> >>>>>>>>> Spark for
>> >>>>>>>>> ETL, which persistent data store to use, and how to query that
>> >>>>>>>>> data store in
>> >>>>>>>>> the backend of the web UI, for displaying the reports).
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> Thanks.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Thanks
>> >> Deepak
>> >> www.bigdatabig.com
>> >> www.keosha.net
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net

Re: Architecture recommendations for a tricky use case

Posted by Deepak Sharma <de...@gmail.com>.

Hi Cody
Spark direct stream is just fine for this use case.
But why postgres and not cassandra?
Is there anything specific here that i may not be aware?

Thanks
Deepak

On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <co...@koeninger.org> wrote:

> How are you going to handle etl failures?  Do you care about lost /
> duplicated data?  Are your writes idempotent?
>
> Absent any other information about the problem, I'd stay away from
> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
> feeding postgres.
>
> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <al...@gmail.com> wrote:
> > Is there an advantage to that vs directly consuming from Kafka? Nothing
> is
> > being done to the data except some light ETL and then storing it in
> > Cassandra
> >
> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <de...@gmail.com>
> > wrote:
> >>
> >> Its better you use spark's direct stream to ingest from kafka.
> >>
> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <al...@gmail.com>
> wrote:
> >>>
> >>> I don't think I need a different speed storage and batch storage. Just
> >>> taking in raw data from Kafka, standardizing, and storing it somewhere
> where
> >>> the web UI can query it, seems like it will be enough.
> >>>
> >>> I'm thinking about:
> >>>
> >>> - Reading data from Kafka via Spark Streaming
> >>> - Standardizing, then storing it in Cassandra
> >>> - Querying Cassandra from the web ui
> >>>
> >>> That seems like it will work. My question now is whether to use Spark
> >>> Streaming to read Kafka, or use Kafka consumers directly.
> >>>
> >>>
> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
> >>> <mi...@gmail.com> wrote:
> >>>>
> >>>> - Spark Streaming to read data from Kafka
> >>>> - Storing the data on HDFS using Flume
> >>>>
> >>>> You don't need Spark streaming to read data from Kafka and store on
> >>>> HDFS. It is a waste of resources.
> >>>>
> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
> >>>>
> >>>> KafkaAgent.sources = kafka-sources
> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
> >>>>
> >>>> That will be for your batch layer. To analyse you can directly read
> from
> >>>> hdfs files with Spark or simply store data in a database of your
> choice via
> >>>> cron or something. Do not mix your batch layer with speed layer.
> >>>>
> >>>> Your speed layer will ingest the same data directly from Kafka into
> >>>> spark streaming and that will be  online or near real time (defined
> by your
> >>>> window).
> >>>>
> >>>> Then you have a a serving layer to present data from both speed  (the
> >>>> one from SS) and batch layer.
> >>>>
> >>>> HTH
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Dr Mich Talebzadeh
> >>>>
> >>>>
> >>>>
> >>>> LinkedIn
> >>>> https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>>>
> >>>>
> >>>>
> >>>> http://talebzadehmich.wordpress.com
> >>>>
> >>>>
> >>>> Disclaimer: Use it at your own risk. Any and all responsibility for
> any
> >>>> loss, damage or destruction of data or any other property which may
> arise
> >>>> from relying on this email's technical content is explicitly
> disclaimed. The
> >>>> author will in no case be liable for any monetary damages arising
> from such
> >>>> loss, damage or destruction.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com>
> wrote:
> >>>>>
> >>>>> The web UI is actually the speed layer, it needs to be able to query
> >>>>> the data online, and show the results in real-time.
> >>>>>
> >>>>> It also needs a custom front-end, so a system like Tableau can't be
> >>>>> used, it must have a custom backend + front-end.
> >>>>>
> >>>>> Thanks for the recommendation of Flume. Do you think this will work:
> >>>>>
> >>>>> - Spark Streaming to read data from Kafka
> >>>>> - Storing the data on HDFS using Flume
> >>>>> - Using Spark to query the data in the backend of the web UI?
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
> >>>>> <mi...@gmail.com> wrote:
> >>>>>>
> >>>>>> You need a batch layer and a speed layer. Data from Kafka can be
> >>>>>> stored on HDFS using flume.
> >>>>>>
> >>>>>> -  Query this data to generate reports / analytics (There will be a
> >>>>>> web UI which will be the front-end to the data, and will show the
> reports)
> >>>>>>
> >>>>>> This is basically batch layer and you need something like Tableau or
> >>>>>> Zeppelin to query data
> >>>>>>
> >>>>>> You will also need spark streaming to query data online for speed
> >>>>>> layer. That data could be stored in some transient fabric like
> ignite or
> >>>>>> even druid.
> >>>>>>
> >>>>>> HTH
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Dr Mich Talebzadeh
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> LinkedIn
> >>>>>> https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> http://talebzadehmich.wordpress.com
> >>>>>>
> >>>>>>
> >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for
> >>>>>> any loss, damage or destruction of data or any other property which
> may
> >>>>>> arise from relying on this email's technical content is explicitly
> >>>>>> disclaimed. The author will in no case be liable for any monetary
> damages
> >>>>>> arising from such loss, damage or destruction.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> It needs to be able to scale to a very large amount of data, yes.
> >>>>>>>
> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
> >>>>>>> <de...@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> What is the message inflow ?
> >>>>>>>> If it's really high , definitely spark will be of great use .
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>> Deepak
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
> >>>>>>>>>
> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing
> their
> >>>>>>>>> raw data into Kafka.
> >>>>>>>>>
> >>>>>>>>> I need to:
> >>>>>>>>>
> >>>>>>>>> - Do ETL on the data, and standardize it.
> >>>>>>>>>
> >>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw
> >>>>>>>>> HDFS / ElasticSearch / Postgres)
> >>>>>>>>>
> >>>>>>>>> - Query this data to generate reports / analytics (There will be
> a
> >>>>>>>>> web UI which will be the front-end to the data, and will show
> the reports)
> >>>>>>>>>
> >>>>>>>>> Java is being used as the backend language for everything
> (backend
> >>>>>>>>> of the web UI, as well as the ETL layer)
> >>>>>>>>>
> >>>>>>>>> I'm considering:
> >>>>>>>>>
> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
> >>>>>>>>> (receive raw data from Kafka, standardize & store it)
> >>>>>>>>>
> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the
> standardized
> >>>>>>>>> data, and to allow queries
> >>>>>>>>>
> >>>>>>>>> - In the backend of the web UI, I could either use Spark to run
> >>>>>>>>> queries across the data (mostly filters), or directly run
> queries against
> >>>>>>>>> Cassandra / HBase
> >>>>>>>>>
> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of these
> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs
> Spark for
> >>>>>>>>> ETL, which persistent data store to use, and how to query that
> data store in
> >>>>>>>>> the backend of the web UI, for displaying the reports).
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks.
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> Thanks
> >> Deepak
> >> www.bigdatabig.com
> >> www.keosha.net
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Re: Architecture recommendations for a tricky use case

Posted by Deepak Sharma <de...@gmail.com>.

Hi Cody
Spark direct stream is just fine for this use case.
But why postgres and not cassandra?
Is there anything specific here that i may not be aware?

Thanks
Deepak

On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <co...@koeninger.org> wrote:

> How are you going to handle etl failures?  Do you care about lost /
> duplicated data?  Are your writes idempotent?
>
> Absent any other information about the problem, I'd stay away from
> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
> feeding postgres.
>
> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <al...@gmail.com> wrote:
> > Is there an advantage to that vs directly consuming from Kafka? Nothing
> is
> > being done to the data except some light ETL and then storing it in
> > Cassandra
> >
> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <de...@gmail.com>
> > wrote:
> >>
> >> Its better you use spark's direct stream to ingest from kafka.
> >>
> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <al...@gmail.com>
> wrote:
> >>>
> >>> I don't think I need a different speed storage and batch storage. Just
> >>> taking in raw data from Kafka, standardizing, and storing it somewhere
> where
> >>> the web UI can query it, seems like it will be enough.
> >>>
> >>> I'm thinking about:
> >>>
> >>> - Reading data from Kafka via Spark Streaming
> >>> - Standardizing, then storing it in Cassandra
> >>> - Querying Cassandra from the web ui
> >>>
> >>> That seems like it will work. My question now is whether to use Spark
> >>> Streaming to read Kafka, or use Kafka consumers directly.
> >>>
> >>>
> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
> >>> <mi...@gmail.com> wrote:
> >>>>
> >>>> - Spark Streaming to read data from Kafka
> >>>> - Storing the data on HDFS using Flume
> >>>>
> >>>> You don't need Spark streaming to read data from Kafka and store on
> >>>> HDFS. It is a waste of resources.
> >>>>
> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
> >>>>
> >>>> KafkaAgent.sources = kafka-sources
> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
> >>>>
> >>>> That will be for your batch layer. To analyse you can directly read
> from
> >>>> hdfs files with Spark or simply store data in a database of your
> choice via
> >>>> cron or something. Do not mix your batch layer with speed layer.
> >>>>
> >>>> Your speed layer will ingest the same data directly from Kafka into
> >>>> spark streaming and that will be  online or near real time (defined
> by your
> >>>> window).
> >>>>
> >>>> Then you have a a serving layer to present data from both speed  (the
> >>>> one from SS) and batch layer.
> >>>>
> >>>> HTH
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Dr Mich Talebzadeh
> >>>>
> >>>>
> >>>>
> >>>> LinkedIn
> >>>> https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>>>
> >>>>
> >>>>
> >>>> http://talebzadehmich.wordpress.com
> >>>>
> >>>>
> >>>> Disclaimer: Use it at your own risk. Any and all responsibility for
> any
> >>>> loss, damage or destruction of data or any other property which may
> arise
> >>>> from relying on this email's technical content is explicitly
> disclaimed. The
> >>>> author will in no case be liable for any monetary damages arising
> from such
> >>>> loss, damage or destruction.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com>
> wrote:
> >>>>>
> >>>>> The web UI is actually the speed layer, it needs to be able to query
> >>>>> the data online, and show the results in real-time.
> >>>>>
> >>>>> It also needs a custom front-end, so a system like Tableau can't be
> >>>>> used, it must have a custom backend + front-end.
> >>>>>
> >>>>> Thanks for the recommendation of Flume. Do you think this will work:
> >>>>>
> >>>>> - Spark Streaming to read data from Kafka
> >>>>> - Storing the data on HDFS using Flume
> >>>>> - Using Spark to query the data in the backend of the web UI?
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
> >>>>> <mi...@gmail.com> wrote:
> >>>>>>
> >>>>>> You need a batch layer and a speed layer. Data from Kafka can be
> >>>>>> stored on HDFS using flume.
> >>>>>>
> >>>>>> -  Query this data to generate reports / analytics (There will be a
> >>>>>> web UI which will be the front-end to the data, and will show the
> reports)
> >>>>>>
> >>>>>> This is basically batch layer and you need something like Tableau or
> >>>>>> Zeppelin to query data
> >>>>>>
> >>>>>> You will also need spark streaming to query data online for speed
> >>>>>> layer. That data could be stored in some transient fabric like
> ignite or
> >>>>>> even druid.
> >>>>>>
> >>>>>> HTH
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Dr Mich Talebzadeh
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> LinkedIn
> >>>>>> https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> http://talebzadehmich.wordpress.com
> >>>>>>
> >>>>>>
> >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for
> >>>>>> any loss, damage or destruction of data or any other property which
> may
> >>>>>> arise from relying on this email's technical content is explicitly
> >>>>>> disclaimed. The author will in no case be liable for any monetary
> damages
> >>>>>> arising from such loss, damage or destruction.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> It needs to be able to scale to a very large amount of data, yes.
> >>>>>>>
> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
> >>>>>>> <de...@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> What is the message inflow ?
> >>>>>>>> If it's really high , definitely spark will be of great use .
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>> Deepak
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
> >>>>>>>>>
> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing
> their
> >>>>>>>>> raw data into Kafka.
> >>>>>>>>>
> >>>>>>>>> I need to:
> >>>>>>>>>
> >>>>>>>>> - Do ETL on the data, and standardize it.
> >>>>>>>>>
> >>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw
> >>>>>>>>> HDFS / ElasticSearch / Postgres)
> >>>>>>>>>
> >>>>>>>>> - Query this data to generate reports / analytics (There will be
> a
> >>>>>>>>> web UI which will be the front-end to the data, and will show
> the reports)
> >>>>>>>>>
> >>>>>>>>> Java is being used as the backend language for everything
> (backend
> >>>>>>>>> of the web UI, as well as the ETL layer)
> >>>>>>>>>
> >>>>>>>>> I'm considering:
> >>>>>>>>>
> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
> >>>>>>>>> (receive raw data from Kafka, standardize & store it)
> >>>>>>>>>
> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the
> standardized
> >>>>>>>>> data, and to allow queries
> >>>>>>>>>
> >>>>>>>>> - In the backend of the web UI, I could either use Spark to run
> >>>>>>>>> queries across the data (mostly filters), or directly run
> queries against
> >>>>>>>>> Cassandra / HBase
> >>>>>>>>>
> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of these
> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs
> Spark for
> >>>>>>>>> ETL, which persistent data store to use, and how to query that
> data store in
> >>>>>>>>> the backend of the web UI, for displaying the reports).
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks.
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> Thanks
> >> Deepak
> >> www.bigdatabig.com
> >> www.keosha.net
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Re: Architecture recommendations for a tricky use case

Posted by Cody Koeninger <co...@koeninger.org>.

How are you going to handle etl failures?  Do you care about lost /
duplicated data?  Are your writes idempotent?

Absent any other information about the problem, I'd stay away from
cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
feeding postgres.

On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <al...@gmail.com> wrote:
> Is there an advantage to that vs directly consuming from Kafka? Nothing is
> being done to the data except some light ETL and then storing it in
> Cassandra
>
> On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <de...@gmail.com>
> wrote:
>>
>> Its better you use spark's direct stream to ingest from kafka.
>>
>> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <al...@gmail.com> wrote:
>>>
>>> I don't think I need a different speed storage and batch storage. Just
>>> taking in raw data from Kafka, standardizing, and storing it somewhere where
>>> the web UI can query it, seems like it will be enough.
>>>
>>> I'm thinking about:
>>>
>>> - Reading data from Kafka via Spark Streaming
>>> - Standardizing, then storing it in Cassandra
>>> - Querying Cassandra from the web ui
>>>
>>> That seems like it will work. My question now is whether to use Spark
>>> Streaming to read Kafka, or use Kafka consumers directly.
>>>
>>>
>>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>>> <mi...@gmail.com> wrote:
>>>>
>>>> - Spark Streaming to read data from Kafka
>>>> - Storing the data on HDFS using Flume
>>>>
>>>> You don't need Spark streaming to read data from Kafka and store on
>>>> HDFS. It is a waste of resources.
>>>>
>>>> Couple Flume to use Kafka as source and HDFS as sink directly
>>>>
>>>> KafkaAgent.sources = kafka-sources
>>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>>>
>>>> That will be for your batch layer. To analyse you can directly read from
>>>> hdfs files with Spark or simply store data in a database of your choice via
>>>> cron or something. Do not mix your batch layer with speed layer.
>>>>
>>>> Your speed layer will ingest the same data directly from Kafka into
>>>> spark streaming and that will be  online or near real time (defined by your
>>>> window).
>>>>
>>>> Then you have a a serving layer to present data from both speed  (the
>>>> one from SS) and batch layer.
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>>>> loss, damage or destruction of data or any other property which may arise
>>>> from relying on this email's technical content is explicitly disclaimed. The
>>>> author will in no case be liable for any monetary damages arising from such
>>>> loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com> wrote:
>>>>>
>>>>> The web UI is actually the speed layer, it needs to be able to query
>>>>> the data online, and show the results in real-time.
>>>>>
>>>>> It also needs a custom front-end, so a system like Tableau can't be
>>>>> used, it must have a custom backend + front-end.
>>>>>
>>>>> Thanks for the recommendation of Flume. Do you think this will work:
>>>>>
>>>>> - Spark Streaming to read data from Kafka
>>>>> - Storing the data on HDFS using Flume
>>>>> - Using Spark to query the data in the backend of the web UI?
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
>>>>> <mi...@gmail.com> wrote:
>>>>>>
>>>>>> You need a batch layer and a speed layer. Data from Kafka can be
>>>>>> stored on HDFS using flume.
>>>>>>
>>>>>> -  Query this data to generate reports / analytics (There will be a
>>>>>> web UI which will be the front-end to the data, and will show the reports)
>>>>>>
>>>>>> This is basically batch layer and you need something like Tableau or
>>>>>> Zeppelin to query data
>>>>>>
>>>>>> You will also need spark streaming to query data online for speed
>>>>>> layer. That data could be stored in some transient fabric like ignite or
>>>>>> even druid.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn
>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for
>>>>>> any loss, damage or destruction of data or any other property which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> It needs to be able to scale to a very large amount of data, yes.
>>>>>>>
>>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
>>>>>>> <de...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> What is the message inflow ?
>>>>>>>> If it's really high , definitely spark will be of great use .
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Deepak
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>>>>>>
>>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>>>>>>>> raw data into Kafka.
>>>>>>>>>
>>>>>>>>> I need to:
>>>>>>>>>
>>>>>>>>> - Do ETL on the data, and standardize it.
>>>>>>>>>
>>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw
>>>>>>>>> HDFS / ElasticSearch / Postgres)
>>>>>>>>>
>>>>>>>>> - Query this data to generate reports / analytics (There will be a
>>>>>>>>> web UI which will be the front-end to the data, and will show the reports)
>>>>>>>>>
>>>>>>>>> Java is being used as the backend language for everything (backend
>>>>>>>>> of the web UI, as well as the ETL layer)
>>>>>>>>>
>>>>>>>>> I'm considering:
>>>>>>>>>
>>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>>>>>>> (receive raw data from Kafka, standardize & store it)
>>>>>>>>>
>>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>>>>>>> data, and to allow queries
>>>>>>>>>
>>>>>>>>> - In the backend of the web UI, I could either use Spark to run
>>>>>>>>> queries across the data (mostly filters), or directly run queries against
>>>>>>>>> Cassandra / HBase
>>>>>>>>>
>>>>>>>>> I'd appreciate some thoughts / suggestions on which of these
>>>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
>>>>>>>>> ETL, which persistent data store to use, and how to query that data store in
>>>>>>>>> the backend of the web UI, for displaying the reports).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>> www.keosha.net
>
>

Re: Architecture recommendations for a tricky use case

Posted by Cody Koeninger <co...@koeninger.org>.

How are you going to handle etl failures?  Do you care about lost /
duplicated data?  Are your writes idempotent?

Absent any other information about the problem, I'd stay away from
cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
feeding postgres.

On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <al...@gmail.com> wrote:
> Is there an advantage to that vs directly consuming from Kafka? Nothing is
> being done to the data except some light ETL and then storing it in
> Cassandra
>
> On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <de...@gmail.com>
> wrote:
>>
>> Its better you use spark's direct stream to ingest from kafka.
>>
>> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <al...@gmail.com> wrote:
>>>
>>> I don't think I need a different speed storage and batch storage. Just
>>> taking in raw data from Kafka, standardizing, and storing it somewhere where
>>> the web UI can query it, seems like it will be enough.
>>>
>>> I'm thinking about:
>>>
>>> - Reading data from Kafka via Spark Streaming
>>> - Standardizing, then storing it in Cassandra
>>> - Querying Cassandra from the web ui
>>>
>>> That seems like it will work. My question now is whether to use Spark
>>> Streaming to read Kafka, or use Kafka consumers directly.
>>>
>>>
>>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>>> <mi...@gmail.com> wrote:
>>>>
>>>> - Spark Streaming to read data from Kafka
>>>> - Storing the data on HDFS using Flume
>>>>
>>>> You don't need Spark streaming to read data from Kafka and store on
>>>> HDFS. It is a waste of resources.
>>>>
>>>> Couple Flume to use Kafka as source and HDFS as sink directly
>>>>
>>>> KafkaAgent.sources = kafka-sources
>>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>>>
>>>> That will be for your batch layer. To analyse you can directly read from
>>>> hdfs files with Spark or simply store data in a database of your choice via
>>>> cron or something. Do not mix your batch layer with speed layer.
>>>>
>>>> Your speed layer will ingest the same data directly from Kafka into
>>>> spark streaming and that will be  online or near real time (defined by your
>>>> window).
>>>>
>>>> Then you have a a serving layer to present data from both speed  (the
>>>> one from SS) and batch layer.
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>>>> loss, damage or destruction of data or any other property which may arise
>>>> from relying on this email's technical content is explicitly disclaimed. The
>>>> author will in no case be liable for any monetary damages arising from such
>>>> loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com> wrote:
>>>>>
>>>>> The web UI is actually the speed layer, it needs to be able to query
>>>>> the data online, and show the results in real-time.
>>>>>
>>>>> It also needs a custom front-end, so a system like Tableau can't be
>>>>> used, it must have a custom backend + front-end.
>>>>>
>>>>> Thanks for the recommendation of Flume. Do you think this will work:
>>>>>
>>>>> - Spark Streaming to read data from Kafka
>>>>> - Storing the data on HDFS using Flume
>>>>> - Using Spark to query the data in the backend of the web UI?
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
>>>>> <mi...@gmail.com> wrote:
>>>>>>
>>>>>> You need a batch layer and a speed layer. Data from Kafka can be
>>>>>> stored on HDFS using flume.
>>>>>>
>>>>>> -  Query this data to generate reports / analytics (There will be a
>>>>>> web UI which will be the front-end to the data, and will show the reports)
>>>>>>
>>>>>> This is basically batch layer and you need something like Tableau or
>>>>>> Zeppelin to query data
>>>>>>
>>>>>> You will also need spark streaming to query data online for speed
>>>>>> layer. That data could be stored in some transient fabric like ignite or
>>>>>> even druid.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn
>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for
>>>>>> any loss, damage or destruction of data or any other property which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> It needs to be able to scale to a very large amount of data, yes.
>>>>>>>
>>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
>>>>>>> <de...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> What is the message inflow ?
>>>>>>>> If it's really high , definitely spark will be of great use .
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Deepak
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>>>>>>
>>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>>>>>>>> raw data into Kafka.
>>>>>>>>>
>>>>>>>>> I need to:
>>>>>>>>>
>>>>>>>>> - Do ETL on the data, and standardize it.
>>>>>>>>>
>>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw
>>>>>>>>> HDFS / ElasticSearch / Postgres)
>>>>>>>>>
>>>>>>>>> - Query this data to generate reports / analytics (There will be a
>>>>>>>>> web UI which will be the front-end to the data, and will show the reports)
>>>>>>>>>
>>>>>>>>> Java is being used as the backend language for everything (backend
>>>>>>>>> of the web UI, as well as the ETL layer)
>>>>>>>>>
>>>>>>>>> I'm considering:
>>>>>>>>>
>>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>>>>>>> (receive raw data from Kafka, standardize & store it)
>>>>>>>>>
>>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>>>>>>> data, and to allow queries
>>>>>>>>>
>>>>>>>>> - In the backend of the web UI, I could either use Spark to run
>>>>>>>>> queries across the data (mostly filters), or directly run queries against
>>>>>>>>> Cassandra / HBase
>>>>>>>>>
>>>>>>>>> I'd appreciate some thoughts / suggestions on which of these
>>>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
>>>>>>>>> ETL, which persistent data store to use, and how to query that data store in
>>>>>>>>> the backend of the web UI, for displaying the reports).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>> www.keosha.net
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Architecture recommendations for a tricky use case

Posted by Ali Akhtar <al...@gmail.com>.

Is there an advantage to that vs directly consuming from Kafka? Nothing is
being done to the data except some light ETL and then storing it in
Cassandra

On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <de...@gmail.com>
wrote:

> Its better you use spark's direct stream to ingest from kafka.
>
> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <al...@gmail.com> wrote:
>
>> I don't think I need a different speed storage and batch storage. Just
>> taking in raw data from Kafka, standardizing, and storing it somewhere
>> where the web UI can query it, seems like it will be enough.
>>
>> I'm thinking about:
>>
>> - Reading data from Kafka via Spark Streaming
>> - Standardizing, then storing it in Cassandra
>> - Querying Cassandra from the web ui
>>
>> That seems like it will work. My question now is whether to use Spark
>> Streaming to read Kafka, or use Kafka consumers directly.
>>
>>
>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> - Spark Streaming to read data from Kafka
>>> - Storing the data on HDFS using Flume
>>>
>>> You don't need Spark streaming to read data from Kafka and store on
>>> HDFS. It is a waste of resources.
>>>
>>> Couple Flume to use Kafka as source and HDFS as sink directly
>>>
>>> KafkaAgent.sources = kafka-sources
>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>>
>>> That will be for your batch layer. To analyse you can directly read from
>>> hdfs files with Spark or simply store data in a database of your choice via
>>> cron or something. Do not mix your batch layer with speed layer.
>>>
>>> Your speed layer will ingest the same data directly from Kafka into
>>> spark streaming and that will be  online or near real time (defined by your
>>> window).
>>>
>>> Then you have a a serving layer to present data from both speed  (the
>>> one from SS) and batch layer.
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com> wrote:
>>>
>>>> The web UI is actually the speed layer, it needs to be able to query
>>>> the data online, and show the results in real-time.
>>>>
>>>> It also needs a custom front-end, so a system like Tableau can't be
>>>> used, it must have a custom backend + front-end.
>>>>
>>>> Thanks for the recommendation of Flume. Do you think this will work:
>>>>
>>>> - Spark Streaming to read data from Kafka
>>>> - Storing the data on HDFS using Flume
>>>> - Using Spark to query the data in the backend of the web UI?
>>>>
>>>>
>>>>
>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>> You need a batch layer and a speed layer. Data from Kafka can be
>>>>> stored on HDFS using flume.
>>>>>
>>>>> -  Query this data to generate reports / analytics (There will be a
>>>>> web UI which will be the front-end to the data, and will show the reports)
>>>>>
>>>>> This is basically batch layer and you need something like Tableau or
>>>>> Zeppelin to query data
>>>>>
>>>>> You will also need spark streaming to query data online for speed
>>>>> layer. That data could be stored in some transient fabric like ignite or
>>>>> even druid.
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> It needs to be able to scale to a very large amount of data, yes.
>>>>>>
>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma <deepakmca05@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> What is the message inflow ?
>>>>>>> If it's really high , definitely spark will be of great use .
>>>>>>>
>>>>>>> Thanks
>>>>>>> Deepak
>>>>>>>
>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:
>>>>>>>
>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>>>>>
>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>>>>>>> raw data into Kafka.
>>>>>>>>
>>>>>>>> I need to:
>>>>>>>>
>>>>>>>> - Do ETL on the data, and standardize it.
>>>>>>>>
>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw
>>>>>>>> HDFS / ElasticSearch / Postgres)
>>>>>>>>
>>>>>>>> - Query this data to generate reports / analytics (There will be a
>>>>>>>> web UI which will be the front-end to the data, and will show the reports)
>>>>>>>>
>>>>>>>> Java is being used as the backend language for everything (backend
>>>>>>>> of the web UI, as well as the ETL layer)
>>>>>>>>
>>>>>>>> I'm considering:
>>>>>>>>
>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>>>>>> (receive raw data from Kafka, standardize & store it)
>>>>>>>>
>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>>>>>> data, and to allow queries
>>>>>>>>
>>>>>>>> - In the backend of the web UI, I could either use Spark to run
>>>>>>>> queries across the data (mostly filters), or directly run queries against
>>>>>>>> Cassandra / HBase
>>>>>>>>
>>>>>>>> I'd appreciate some thoughts / suggestions on which of these
>>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
>>>>>>>> ETL, which persistent data store to use, and how to query that data store
>>>>>>>> in the backend of the web UI, for displaying the reports).
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>

Re: Architecture recommendations for a tricky use case

Posted by Ali Akhtar <al...@gmail.com>.

Is there an advantage to that vs directly consuming from Kafka? Nothing is
being done to the data except some light ETL and then storing it in
Cassandra

On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <de...@gmail.com>
wrote:

> Its better you use spark's direct stream to ingest from kafka.
>
> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <al...@gmail.com> wrote:
>
>> I don't think I need a different speed storage and batch storage. Just
>> taking in raw data from Kafka, standardizing, and storing it somewhere
>> where the web UI can query it, seems like it will be enough.
>>
>> I'm thinking about:
>>
>> - Reading data from Kafka via Spark Streaming
>> - Standardizing, then storing it in Cassandra
>> - Querying Cassandra from the web ui
>>
>> That seems like it will work. My question now is whether to use Spark
>> Streaming to read Kafka, or use Kafka consumers directly.
>>
>>
>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> - Spark Streaming to read data from Kafka
>>> - Storing the data on HDFS using Flume
>>>
>>> You don't need Spark streaming to read data from Kafka and store on
>>> HDFS. It is a waste of resources.
>>>
>>> Couple Flume to use Kafka as source and HDFS as sink directly
>>>
>>> KafkaAgent.sources = kafka-sources
>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>>
>>> That will be for your batch layer. To analyse you can directly read from
>>> hdfs files with Spark or simply store data in a database of your choice via
>>> cron or something. Do not mix your batch layer with speed layer.
>>>
>>> Your speed layer will ingest the same data directly from Kafka into
>>> spark streaming and that will be  online or near real time (defined by your
>>> window).
>>>
>>> Then you have a a serving layer to present data from both speed  (the
>>> one from SS) and batch layer.
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com> wrote:
>>>
>>>> The web UI is actually the speed layer, it needs to be able to query
>>>> the data online, and show the results in real-time.
>>>>
>>>> It also needs a custom front-end, so a system like Tableau can't be
>>>> used, it must have a custom backend + front-end.
>>>>
>>>> Thanks for the recommendation of Flume. Do you think this will work:
>>>>
>>>> - Spark Streaming to read data from Kafka
>>>> - Storing the data on HDFS using Flume
>>>> - Using Spark to query the data in the backend of the web UI?
>>>>
>>>>
>>>>
>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>> You need a batch layer and a speed layer. Data from Kafka can be
>>>>> stored on HDFS using flume.
>>>>>
>>>>> -  Query this data to generate reports / analytics (There will be a
>>>>> web UI which will be the front-end to the data, and will show the reports)
>>>>>
>>>>> This is basically batch layer and you need something like Tableau or
>>>>> Zeppelin to query data
>>>>>
>>>>> You will also need spark streaming to query data online for speed
>>>>> layer. That data could be stored in some transient fabric like ignite or
>>>>> even druid.
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> It needs to be able to scale to a very large amount of data, yes.
>>>>>>
>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma <deepakmca05@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> What is the message inflow ?
>>>>>>> If it's really high , definitely spark will be of great use .
>>>>>>>
>>>>>>> Thanks
>>>>>>> Deepak
>>>>>>>
>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:
>>>>>>>
>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>>>>>
>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>>>>>>> raw data into Kafka.
>>>>>>>>
>>>>>>>> I need to:
>>>>>>>>
>>>>>>>> - Do ETL on the data, and standardize it.
>>>>>>>>
>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw
>>>>>>>> HDFS / ElasticSearch / Postgres)
>>>>>>>>
>>>>>>>> - Query this data to generate reports / analytics (There will be a
>>>>>>>> web UI which will be the front-end to the data, and will show the reports)
>>>>>>>>
>>>>>>>> Java is being used as the backend language for everything (backend
>>>>>>>> of the web UI, as well as the ETL layer)
>>>>>>>>
>>>>>>>> I'm considering:
>>>>>>>>
>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>>>>>> (receive raw data from Kafka, standardize & store it)
>>>>>>>>
>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>>>>>> data, and to allow queries
>>>>>>>>
>>>>>>>> - In the backend of the web UI, I could either use Spark to run
>>>>>>>> queries across the data (mostly filters), or directly run queries against
>>>>>>>> Cassandra / HBase
>>>>>>>>
>>>>>>>> I'd appreciate some thoughts / suggestions on which of these
>>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
>>>>>>>> ETL, which persistent data store to use, and how to query that data store
>>>>>>>> in the backend of the web UI, for displaying the reports).
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>

Re: Architecture recommendations for a tricky use case

Posted by Deepak Sharma <de...@gmail.com>.

Its better you use spark's direct stream to ingest from kafka.

On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <al...@gmail.com> wrote:

> I don't think I need a different speed storage and batch storage. Just
> taking in raw data from Kafka, standardizing, and storing it somewhere
> where the web UI can query it, seems like it will be enough.
>
> I'm thinking about:
>
> - Reading data from Kafka via Spark Streaming
> - Standardizing, then storing it in Cassandra
> - Querying Cassandra from the web ui
>
> That seems like it will work. My question now is whether to use Spark
> Streaming to read Kafka, or use Kafka consumers directly.
>
>
> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> - Spark Streaming to read data from Kafka
>> - Storing the data on HDFS using Flume
>>
>> You don't need Spark streaming to read data from Kafka and store on HDFS.
>> It is a waste of resources.
>>
>> Couple Flume to use Kafka as source and HDFS as sink directly
>>
>> KafkaAgent.sources = kafka-sources
>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>
>> That will be for your batch layer. To analyse you can directly read from
>> hdfs files with Spark or simply store data in a database of your choice via
>> cron or something. Do not mix your batch layer with speed layer.
>>
>> Your speed layer will ingest the same data directly from Kafka into spark
>> streaming and that will be  online or near real time (defined by your
>> window).
>>
>> Then you have a a serving layer to present data from both speed  (the one
>> from SS) and batch layer.
>>
>> HTH
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com> wrote:
>>
>>> The web UI is actually the speed layer, it needs to be able to query the
>>> data online, and show the results in real-time.
>>>
>>> It also needs a custom front-end, so a system like Tableau can't be
>>> used, it must have a custom backend + front-end.
>>>
>>> Thanks for the recommendation of Flume. Do you think this will work:
>>>
>>> - Spark Streaming to read data from Kafka
>>> - Storing the data on HDFS using Flume
>>> - Using Spark to query the data in the backend of the web UI?
>>>
>>>
>>>
>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> You need a batch layer and a speed layer. Data from Kafka can be stored
>>>> on HDFS using flume.
>>>>
>>>> -  Query this data to generate reports / analytics (There will be a web
>>>> UI which will be the front-end to the data, and will show the reports)
>>>>
>>>> This is basically batch layer and you need something like Tableau or
>>>> Zeppelin to query data
>>>>
>>>> You will also need spark streaming to query data online for speed
>>>> layer. That data could be stored in some transient fabric like ignite or
>>>> even druid.
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com> wrote:
>>>>
>>>>> It needs to be able to scale to a very large amount of data, yes.
>>>>>
>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma <de...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> What is the message inflow ?
>>>>>> If it's really high , definitely spark will be of great use .
>>>>>>
>>>>>> Thanks
>>>>>> Deepak
>>>>>>
>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:
>>>>>>
>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>>>>
>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>>>>>> raw data into Kafka.
>>>>>>>
>>>>>>> I need to:
>>>>>>>
>>>>>>> - Do ETL on the data, and standardize it.
>>>>>>>
>>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw
>>>>>>> HDFS / ElasticSearch / Postgres)
>>>>>>>
>>>>>>> - Query this data to generate reports / analytics (There will be a
>>>>>>> web UI which will be the front-end to the data, and will show the reports)
>>>>>>>
>>>>>>> Java is being used as the backend language for everything (backend
>>>>>>> of the web UI, as well as the ETL layer)
>>>>>>>
>>>>>>> I'm considering:
>>>>>>>
>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>>>>> (receive raw data from Kafka, standardize & store it)
>>>>>>>
>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>>>>> data, and to allow queries
>>>>>>>
>>>>>>> - In the backend of the web UI, I could either use Spark to run
>>>>>>> queries across the data (mostly filters), or directly run queries against
>>>>>>> Cassandra / HBase
>>>>>>>
>>>>>>> I'd appreciate some thoughts / suggestions on which of these
>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
>>>>>>> ETL, which persistent data store to use, and how to query that data store
>>>>>>> in the backend of the web UI, for displaying the reports).
>>>>>>>
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Re: Architecture recommendations for a tricky use case

Posted by Deepak Sharma <de...@gmail.com>.

Its better you use spark's direct stream to ingest from kafka.

On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <al...@gmail.com> wrote:

> I don't think I need a different speed storage and batch storage. Just
> taking in raw data from Kafka, standardizing, and storing it somewhere
> where the web UI can query it, seems like it will be enough.
>
> I'm thinking about:
>
> - Reading data from Kafka via Spark Streaming
> - Standardizing, then storing it in Cassandra
> - Querying Cassandra from the web ui
>
> That seems like it will work. My question now is whether to use Spark
> Streaming to read Kafka, or use Kafka consumers directly.
>
>
> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> - Spark Streaming to read data from Kafka
>> - Storing the data on HDFS using Flume
>>
>> You don't need Spark streaming to read data from Kafka and store on HDFS.
>> It is a waste of resources.
>>
>> Couple Flume to use Kafka as source and HDFS as sink directly
>>
>> KafkaAgent.sources = kafka-sources
>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>
>> That will be for your batch layer. To analyse you can directly read from
>> hdfs files with Spark or simply store data in a database of your choice via
>> cron or something. Do not mix your batch layer with speed layer.
>>
>> Your speed layer will ingest the same data directly from Kafka into spark
>> streaming and that will be  online or near real time (defined by your
>> window).
>>
>> Then you have a a serving layer to present data from both speed  (the one
>> from SS) and batch layer.
>>
>> HTH
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com> wrote:
>>
>>> The web UI is actually the speed layer, it needs to be able to query the
>>> data online, and show the results in real-time.
>>>
>>> It also needs a custom front-end, so a system like Tableau can't be
>>> used, it must have a custom backend + front-end.
>>>
>>> Thanks for the recommendation of Flume. Do you think this will work:
>>>
>>> - Spark Streaming to read data from Kafka
>>> - Storing the data on HDFS using Flume
>>> - Using Spark to query the data in the backend of the web UI?
>>>
>>>
>>>
>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> You need a batch layer and a speed layer. Data from Kafka can be stored
>>>> on HDFS using flume.
>>>>
>>>> -  Query this data to generate reports / analytics (There will be a web
>>>> UI which will be the front-end to the data, and will show the reports)
>>>>
>>>> This is basically batch layer and you need something like Tableau or
>>>> Zeppelin to query data
>>>>
>>>> You will also need spark streaming to query data online for speed
>>>> layer. That data could be stored in some transient fabric like ignite or
>>>> even druid.
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com> wrote:
>>>>
>>>>> It needs to be able to scale to a very large amount of data, yes.
>>>>>
>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma <de...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> What is the message inflow ?
>>>>>> If it's really high , definitely spark will be of great use .
>>>>>>
>>>>>> Thanks
>>>>>> Deepak
>>>>>>
>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:
>>>>>>
>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>>>>
>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>>>>>> raw data into Kafka.
>>>>>>>
>>>>>>> I need to:
>>>>>>>
>>>>>>> - Do ETL on the data, and standardize it.
>>>>>>>
>>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw
>>>>>>> HDFS / ElasticSearch / Postgres)
>>>>>>>
>>>>>>> - Query this data to generate reports / analytics (There will be a
>>>>>>> web UI which will be the front-end to the data, and will show the reports)
>>>>>>>
>>>>>>> Java is being used as the backend language for everything (backend
>>>>>>> of the web UI, as well as the ETL layer)
>>>>>>>
>>>>>>> I'm considering:
>>>>>>>
>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>>>>> (receive raw data from Kafka, standardize & store it)
>>>>>>>
>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>>>>> data, and to allow queries
>>>>>>>
>>>>>>> - In the backend of the web UI, I could either use Spark to run
>>>>>>> queries across the data (mostly filters), or directly run queries against
>>>>>>> Cassandra / HBase
>>>>>>>
>>>>>>> I'd appreciate some thoughts / suggestions on which of these
>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
>>>>>>> ETL, which persistent data store to use, and how to query that data store
>>>>>>> in the backend of the web UI, for displaying the reports).
>>>>>>>
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Re: Architecture recommendations for a tricky use case

Posted by Ali Akhtar <al...@gmail.com>.

I don't think I need a different speed storage and batch storage. Just
taking in raw data from Kafka, standardizing, and storing it somewhere
where the web UI can query it, seems like it will be enough.

I'm thinking about:

- Reading data from Kafka via Spark Streaming
- Standardizing, then storing it in Cassandra
- Querying Cassandra from the web ui

That seems like it will work. My question now is whether to use Spark
Streaming to read Kafka, or use Kafka consumers directly.


On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> - Spark Streaming to read data from Kafka
> - Storing the data on HDFS using Flume
>
> You don't need Spark streaming to read data from Kafka and store on HDFS.
> It is a waste of resources.
>
> Couple Flume to use Kafka as source and HDFS as sink directly
>
> KafkaAgent.sources = kafka-sources
> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>
> That will be for your batch layer. To analyse you can directly read from
> hdfs files with Spark or simply store data in a database of your choice via
> cron or something. Do not mix your batch layer with speed layer.
>
> Your speed layer will ingest the same data directly from Kafka into spark
> streaming and that will be  online or near real time (defined by your
> window).
>
> Then you have a a serving layer to present data from both speed  (the one
> from SS) and batch layer.
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com> wrote:
>
>> The web UI is actually the speed layer, it needs to be able to query the
>> data online, and show the results in real-time.
>>
>> It also needs a custom front-end, so a system like Tableau can't be used,
>> it must have a custom backend + front-end.
>>
>> Thanks for the recommendation of Flume. Do you think this will work:
>>
>> - Spark Streaming to read data from Kafka
>> - Storing the data on HDFS using Flume
>> - Using Spark to query the data in the backend of the web UI?
>>
>>
>>
>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> You need a batch layer and a speed layer. Data from Kafka can be stored
>>> on HDFS using flume.
>>>
>>> -  Query this data to generate reports / analytics (There will be a web
>>> UI which will be the front-end to the data, and will show the reports)
>>>
>>> This is basically batch layer and you need something like Tableau or
>>> Zeppelin to query data
>>>
>>> You will also need spark streaming to query data online for speed layer.
>>> That data could be stored in some transient fabric like ignite or even
>>> druid.
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com> wrote:
>>>
>>>> It needs to be able to scale to a very large amount of data, yes.
>>>>
>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma <de...@gmail.com>
>>>> wrote:
>>>>
>>>>> What is the message inflow ?
>>>>> If it's really high , definitely spark will be of great use .
>>>>>
>>>>> Thanks
>>>>> Deepak
>>>>>
>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:
>>>>>
>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>>>
>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>>>>> raw data into Kafka.
>>>>>>
>>>>>> I need to:
>>>>>>
>>>>>> - Do ETL on the data, and standardize it.
>>>>>>
>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS
>>>>>> / ElasticSearch / Postgres)
>>>>>>
>>>>>> - Query this data to generate reports / analytics (There will be a
>>>>>> web UI which will be the front-end to the data, and will show the reports)
>>>>>>
>>>>>> Java is being used as the backend language for everything (backend of
>>>>>> the web UI, as well as the ETL layer)
>>>>>>
>>>>>> I'm considering:
>>>>>>
>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>>>> (receive raw data from Kafka, standardize & store it)
>>>>>>
>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>>>> data, and to allow queries
>>>>>>
>>>>>> - In the backend of the web UI, I could either use Spark to run
>>>>>> queries across the data (mostly filters), or directly run queries against
>>>>>> Cassandra / HBase
>>>>>>
>>>>>> I'd appreciate some thoughts / suggestions on which of these
>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
>>>>>> ETL, which persistent data store to use, and how to query that data store
>>>>>> in the backend of the web UI, for displaying the reports).
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Architecture recommendations for a tricky use case

Posted by Ali Akhtar <al...@gmail.com>.

I don't think I need a different speed storage and batch storage. Just
taking in raw data from Kafka, standardizing, and storing it somewhere
where the web UI can query it, seems like it will be enough.

I'm thinking about:

- Reading data from Kafka via Spark Streaming
- Standardizing, then storing it in Cassandra
- Querying Cassandra from the web ui

That seems like it will work. My question now is whether to use Spark
Streaming to read Kafka, or use Kafka consumers directly.


On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> - Spark Streaming to read data from Kafka
> - Storing the data on HDFS using Flume
>
> You don't need Spark streaming to read data from Kafka and store on HDFS.
> It is a waste of resources.
>
> Couple Flume to use Kafka as source and HDFS as sink directly
>
> KafkaAgent.sources = kafka-sources
> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>
> That will be for your batch layer. To analyse you can directly read from
> hdfs files with Spark or simply store data in a database of your choice via
> cron or something. Do not mix your batch layer with speed layer.
>
> Your speed layer will ingest the same data directly from Kafka into spark
> streaming and that will be  online or near real time (defined by your
> window).
>
> Then you have a a serving layer to present data from both speed  (the one
> from SS) and batch layer.
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com> wrote:
>
>> The web UI is actually the speed layer, it needs to be able to query the
>> data online, and show the results in real-time.
>>
>> It also needs a custom front-end, so a system like Tableau can't be used,
>> it must have a custom backend + front-end.
>>
>> Thanks for the recommendation of Flume. Do you think this will work:
>>
>> - Spark Streaming to read data from Kafka
>> - Storing the data on HDFS using Flume
>> - Using Spark to query the data in the backend of the web UI?
>>
>>
>>
>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> You need a batch layer and a speed layer. Data from Kafka can be stored
>>> on HDFS using flume.
>>>
>>> -  Query this data to generate reports / analytics (There will be a web
>>> UI which will be the front-end to the data, and will show the reports)
>>>
>>> This is basically batch layer and you need something like Tableau or
>>> Zeppelin to query data
>>>
>>> You will also need spark streaming to query data online for speed layer.
>>> That data could be stored in some transient fabric like ignite or even
>>> druid.
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com> wrote:
>>>
>>>> It needs to be able to scale to a very large amount of data, yes.
>>>>
>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma <de...@gmail.com>
>>>> wrote:
>>>>
>>>>> What is the message inflow ?
>>>>> If it's really high , definitely spark will be of great use .
>>>>>
>>>>> Thanks
>>>>> Deepak
>>>>>
>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:
>>>>>
>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>>>
>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>>>>> raw data into Kafka.
>>>>>>
>>>>>> I need to:
>>>>>>
>>>>>> - Do ETL on the data, and standardize it.
>>>>>>
>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS
>>>>>> / ElasticSearch / Postgres)
>>>>>>
>>>>>> - Query this data to generate reports / analytics (There will be a
>>>>>> web UI which will be the front-end to the data, and will show the reports)
>>>>>>
>>>>>> Java is being used as the backend language for everything (backend of
>>>>>> the web UI, as well as the ETL layer)
>>>>>>
>>>>>> I'm considering:
>>>>>>
>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>>>> (receive raw data from Kafka, standardize & store it)
>>>>>>
>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>>>> data, and to allow queries
>>>>>>
>>>>>> - In the backend of the web UI, I could either use Spark to run
>>>>>> queries across the data (mostly filters), or directly run queries against
>>>>>> Cassandra / HBase
>>>>>>
>>>>>> I'd appreciate some thoughts / suggestions on which of these
>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
>>>>>> ETL, which persistent data store to use, and how to query that data store
>>>>>> in the backend of the web UI, for displaying the reports).
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Architecture recommendations for a tricky use case

Posted by Deepak Sharma <de...@gmail.com>.

Since the inflow is huge , flume would also need to be run with multiple
channels in distributed fashion.
In that case , the resource utilization will be high in that case as well.

Thanks
Deepak

On Thu, Sep 29, 2016 at 8:11 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> - Spark Streaming to read data from Kafka
> - Storing the data on HDFS using Flume
>
> You don't need Spark streaming to read data from Kafka and store on HDFS.
> It is a waste of resources.
>
> Couple Flume to use Kafka as source and HDFS as sink directly
>
> KafkaAgent.sources = kafka-sources
> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>
> That will be for your batch layer. To analyse you can directly read from
> hdfs files with Spark or simply store data in a database of your choice via
> cron or something. Do not mix your batch layer with speed layer.
>
> Your speed layer will ingest the same data directly from Kafka into spark
> streaming and that will be  online or near real time (defined by your
> window).
>
> Then you have a a serving layer to present data from both speed  (the one
> from SS) and batch layer.
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com> wrote:
>
>> The web UI is actually the speed layer, it needs to be able to query the
>> data online, and show the results in real-time.
>>
>> It also needs a custom front-end, so a system like Tableau can't be used,
>> it must have a custom backend + front-end.
>>
>> Thanks for the recommendation of Flume. Do you think this will work:
>>
>> - Spark Streaming to read data from Kafka
>> - Storing the data on HDFS using Flume
>> - Using Spark to query the data in the backend of the web UI?
>>
>>
>>
>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> You need a batch layer and a speed layer. Data from Kafka can be stored
>>> on HDFS using flume.
>>>
>>> -  Query this data to generate reports / analytics (There will be a web
>>> UI which will be the front-end to the data, and will show the reports)
>>>
>>> This is basically batch layer and you need something like Tableau or
>>> Zeppelin to query data
>>>
>>> You will also need spark streaming to query data online for speed layer.
>>> That data could be stored in some transient fabric like ignite or even
>>> druid.
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com> wrote:
>>>
>>>> It needs to be able to scale to a very large amount of data, yes.
>>>>
>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma <de...@gmail.com>
>>>> wrote:
>>>>
>>>>> What is the message inflow ?
>>>>> If it's really high , definitely spark will be of great use .
>>>>>
>>>>> Thanks
>>>>> Deepak
>>>>>
>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:
>>>>>
>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>>>
>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>>>>> raw data into Kafka.
>>>>>>
>>>>>> I need to:
>>>>>>
>>>>>> - Do ETL on the data, and standardize it.
>>>>>>
>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS
>>>>>> / ElasticSearch / Postgres)
>>>>>>
>>>>>> - Query this data to generate reports / analytics (There will be a
>>>>>> web UI which will be the front-end to the data, and will show the reports)
>>>>>>
>>>>>> Java is being used as the backend language for everything (backend of
>>>>>> the web UI, as well as the ETL layer)
>>>>>>
>>>>>> I'm considering:
>>>>>>
>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>>>> (receive raw data from Kafka, standardize & store it)
>>>>>>
>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>>>> data, and to allow queries
>>>>>>
>>>>>> - In the backend of the web UI, I could either use Spark to run
>>>>>> queries across the data (mostly filters), or directly run queries against
>>>>>> Cassandra / HBase
>>>>>>
>>>>>> I'd appreciate some thoughts / suggestions on which of these
>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
>>>>>> ETL, which persistent data store to use, and how to query that data store
>>>>>> in the backend of the web UI, for displaying the reports).
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Re: Architecture recommendations for a tricky use case

Posted by Deepak Sharma <de...@gmail.com>.

Since the inflow is huge , flume would also need to be run with multiple
channels in distributed fashion.
In that case , the resource utilization will be high in that case as well.

Thanks
Deepak

On Thu, Sep 29, 2016 at 8:11 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> - Spark Streaming to read data from Kafka
> - Storing the data on HDFS using Flume
>
> You don't need Spark streaming to read data from Kafka and store on HDFS.
> It is a waste of resources.
>
> Couple Flume to use Kafka as source and HDFS as sink directly
>
> KafkaAgent.sources = kafka-sources
> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>
> That will be for your batch layer. To analyse you can directly read from
> hdfs files with Spark or simply store data in a database of your choice via
> cron or something. Do not mix your batch layer with speed layer.
>
> Your speed layer will ingest the same data directly from Kafka into spark
> streaming and that will be  online or near real time (defined by your
> window).
>
> Then you have a a serving layer to present data from both speed  (the one
> from SS) and batch layer.
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com> wrote:
>
>> The web UI is actually the speed layer, it needs to be able to query the
>> data online, and show the results in real-time.
>>
>> It also needs a custom front-end, so a system like Tableau can't be used,
>> it must have a custom backend + front-end.
>>
>> Thanks for the recommendation of Flume. Do you think this will work:
>>
>> - Spark Streaming to read data from Kafka
>> - Storing the data on HDFS using Flume
>> - Using Spark to query the data in the backend of the web UI?
>>
>>
>>
>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> You need a batch layer and a speed layer. Data from Kafka can be stored
>>> on HDFS using flume.
>>>
>>> -  Query this data to generate reports / analytics (There will be a web
>>> UI which will be the front-end to the data, and will show the reports)
>>>
>>> This is basically batch layer and you need something like Tableau or
>>> Zeppelin to query data
>>>
>>> You will also need spark streaming to query data online for speed layer.
>>> That data could be stored in some transient fabric like ignite or even
>>> druid.
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com> wrote:
>>>
>>>> It needs to be able to scale to a very large amount of data, yes.
>>>>
>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma <de...@gmail.com>
>>>> wrote:
>>>>
>>>>> What is the message inflow ?
>>>>> If it's really high , definitely spark will be of great use .
>>>>>
>>>>> Thanks
>>>>> Deepak
>>>>>
>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:
>>>>>
>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>>>
>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>>>>> raw data into Kafka.
>>>>>>
>>>>>> I need to:
>>>>>>
>>>>>> - Do ETL on the data, and standardize it.
>>>>>>
>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS
>>>>>> / ElasticSearch / Postgres)
>>>>>>
>>>>>> - Query this data to generate reports / analytics (There will be a
>>>>>> web UI which will be the front-end to the data, and will show the reports)
>>>>>>
>>>>>> Java is being used as the backend language for everything (backend of
>>>>>> the web UI, as well as the ETL layer)
>>>>>>
>>>>>> I'm considering:
>>>>>>
>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>>>> (receive raw data from Kafka, standardize & store it)
>>>>>>
>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>>>> data, and to allow queries
>>>>>>
>>>>>> - In the backend of the web UI, I could either use Spark to run
>>>>>> queries across the data (mostly filters), or directly run queries against
>>>>>> Cassandra / HBase
>>>>>>
>>>>>> I'd appreciate some thoughts / suggestions on which of these
>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
>>>>>> ETL, which persistent data store to use, and how to query that data store
>>>>>> in the backend of the web UI, for displaying the reports).
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Re: Architecture recommendations for a tricky use case

Posted by Mich Talebzadeh <mi...@gmail.com>.

- Spark Streaming to read data from Kafka
- Storing the data on HDFS using Flume

You don't need Spark streaming to read data from Kafka and store on HDFS.
It is a waste of resources.

Couple Flume to use Kafka as source and HDFS as sink directly

KafkaAgent.sources = kafka-sources
KafkaAgent.sinks.hdfs-sinks.type = hdfs

That will be for your batch layer. To analyse you can directly read from
hdfs files with Spark or simply store data in a database of your choice via
cron or something. Do not mix your batch layer with speed layer.

Your speed layer will ingest the same data directly from Kafka into spark
streaming and that will be  online or near real time (defined by your
window).

Then you have a a serving layer to present data from both speed  (the one
from SS) and batch layer.

HTH




Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com> wrote:

> The web UI is actually the speed layer, it needs to be able to query the
> data online, and show the results in real-time.
>
> It also needs a custom front-end, so a system like Tableau can't be used,
> it must have a custom backend + front-end.
>
> Thanks for the recommendation of Flume. Do you think this will work:
>
> - Spark Streaming to read data from Kafka
> - Storing the data on HDFS using Flume
> - Using Spark to query the data in the backend of the web UI?
>
>
>
> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> You need a batch layer and a speed layer. Data from Kafka can be stored
>> on HDFS using flume.
>>
>> -  Query this data to generate reports / analytics (There will be a web
>> UI which will be the front-end to the data, and will show the reports)
>>
>> This is basically batch layer and you need something like Tableau or
>> Zeppelin to query data
>>
>> You will also need spark streaming to query data online for speed layer.
>> That data could be stored in some transient fabric like ignite or even
>> druid.
>>
>> HTH
>>
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com> wrote:
>>
>>> It needs to be able to scale to a very large amount of data, yes.
>>>
>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma <de...@gmail.com>
>>> wrote:
>>>
>>>> What is the message inflow ?
>>>> If it's really high , definitely spark will be of great use .
>>>>
>>>> Thanks
>>>> Deepak
>>>>
>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:
>>>>
>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>>
>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>>>> raw data into Kafka.
>>>>>
>>>>> I need to:
>>>>>
>>>>> - Do ETL on the data, and standardize it.
>>>>>
>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS
>>>>> / ElasticSearch / Postgres)
>>>>>
>>>>> - Query this data to generate reports / analytics (There will be a web
>>>>> UI which will be the front-end to the data, and will show the reports)
>>>>>
>>>>> Java is being used as the backend language for everything (backend of
>>>>> the web UI, as well as the ETL layer)
>>>>>
>>>>> I'm considering:
>>>>>
>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>>> (receive raw data from Kafka, standardize & store it)
>>>>>
>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>>> data, and to allow queries
>>>>>
>>>>> - In the backend of the web UI, I could either use Spark to run
>>>>> queries across the data (mostly filters), or directly run queries against
>>>>> Cassandra / HBase
>>>>>
>>>>> I'd appreciate some thoughts / suggestions on which of these
>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
>>>>> ETL, which persistent data store to use, and how to query that data store
>>>>> in the backend of the web UI, for displaying the reports).
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>
>>>
>>
>

Re: Architecture recommendations for a tricky use case

Posted by Mich Talebzadeh <mi...@gmail.com>.

- Spark Streaming to read data from Kafka
- Storing the data on HDFS using Flume

You don't need Spark streaming to read data from Kafka and store on HDFS.
It is a waste of resources.

Couple Flume to use Kafka as source and HDFS as sink directly

KafkaAgent.sources = kafka-sources
KafkaAgent.sinks.hdfs-sinks.type = hdfs

That will be for your batch layer. To analyse you can directly read from
hdfs files with Spark or simply store data in a database of your choice via
cron or something. Do not mix your batch layer with speed layer.

Your speed layer will ingest the same data directly from Kafka into spark
streaming and that will be  online or near real time (defined by your
window).

Then you have a a serving layer to present data from both speed  (the one
from SS) and batch layer.

HTH




Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 15:15, Ali Akhtar <al...@gmail.com> wrote:

> The web UI is actually the speed layer, it needs to be able to query the
> data online, and show the results in real-time.
>
> It also needs a custom front-end, so a system like Tableau can't be used,
> it must have a custom backend + front-end.
>
> Thanks for the recommendation of Flume. Do you think this will work:
>
> - Spark Streaming to read data from Kafka
> - Storing the data on HDFS using Flume
> - Using Spark to query the data in the backend of the web UI?
>
>
>
> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> You need a batch layer and a speed layer. Data from Kafka can be stored
>> on HDFS using flume.
>>
>> -  Query this data to generate reports / analytics (There will be a web
>> UI which will be the front-end to the data, and will show the reports)
>>
>> This is basically batch layer and you need something like Tableau or
>> Zeppelin to query data
>>
>> You will also need spark streaming to query data online for speed layer.
>> That data could be stored in some transient fabric like ignite or even
>> druid.
>>
>> HTH
>>
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com> wrote:
>>
>>> It needs to be able to scale to a very large amount of data, yes.
>>>
>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma <de...@gmail.com>
>>> wrote:
>>>
>>>> What is the message inflow ?
>>>> If it's really high , definitely spark will be of great use .
>>>>
>>>> Thanks
>>>> Deepak
>>>>
>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:
>>>>
>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>>
>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>>>> raw data into Kafka.
>>>>>
>>>>> I need to:
>>>>>
>>>>> - Do ETL on the data, and standardize it.
>>>>>
>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS
>>>>> / ElasticSearch / Postgres)
>>>>>
>>>>> - Query this data to generate reports / analytics (There will be a web
>>>>> UI which will be the front-end to the data, and will show the reports)
>>>>>
>>>>> Java is being used as the backend language for everything (backend of
>>>>> the web UI, as well as the ETL layer)
>>>>>
>>>>> I'm considering:
>>>>>
>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>>> (receive raw data from Kafka, standardize & store it)
>>>>>
>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>>> data, and to allow queries
>>>>>
>>>>> - In the backend of the web UI, I could either use Spark to run
>>>>> queries across the data (mostly filters), or directly run queries against
>>>>> Cassandra / HBase
>>>>>
>>>>> I'd appreciate some thoughts / suggestions on which of these
>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
>>>>> ETL, which persistent data store to use, and how to query that data store
>>>>> in the backend of the web UI, for displaying the reports).
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>
>>>
>>
>

Re: Architecture recommendations for a tricky use case

Posted by Ali Akhtar <al...@gmail.com>.

The web UI is actually the speed layer, it needs to be able to query the
data online, and show the results in real-time.

It also needs a custom front-end, so a system like Tableau can't be used,
it must have a custom backend + front-end.

Thanks for the recommendation of Flume. Do you think this will work:

- Spark Streaming to read data from Kafka
- Storing the data on HDFS using Flume
- Using Spark to query the data in the backend of the web UI?



On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> You need a batch layer and a speed layer. Data from Kafka can be stored on
> HDFS using flume.
>
> -  Query this data to generate reports / analytics (There will be a web UI
> which will be the front-end to the data, and will show the reports)
>
> This is basically batch layer and you need something like Tableau or
> Zeppelin to query data
>
> You will also need spark streaming to query data online for speed layer.
> That data could be stored in some transient fabric like ignite or even
> druid.
>
> HTH
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com> wrote:
>
>> It needs to be able to scale to a very large amount of data, yes.
>>
>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma <de...@gmail.com>
>> wrote:
>>
>>> What is the message inflow ?
>>> If it's really high , definitely spark will be of great use .
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:
>>>
>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>
>>>> I have 5-6 Kafka producers, reading various APIs, and writing their raw
>>>> data into Kafka.
>>>>
>>>> I need to:
>>>>
>>>> - Do ETL on the data, and standardize it.
>>>>
>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
>>>> ElasticSearch / Postgres)
>>>>
>>>> - Query this data to generate reports / analytics (There will be a web
>>>> UI which will be the front-end to the data, and will show the reports)
>>>>
>>>> Java is being used as the backend language for everything (backend of
>>>> the web UI, as well as the ETL layer)
>>>>
>>>> I'm considering:
>>>>
>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>> (receive raw data from Kafka, standardize & store it)
>>>>
>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>> data, and to allow queries
>>>>
>>>> - In the backend of the web UI, I could either use Spark to run queries
>>>> across the data (mostly filters), or directly run queries against Cassandra
>>>> / HBase
>>>>
>>>> I'd appreciate some thoughts / suggestions on which of these
>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
>>>> ETL, which persistent data store to use, and how to query that data store
>>>> in the backend of the web UI, for displaying the reports).
>>>>
>>>>
>>>> Thanks.
>>>>
>>>
>>
>

Re: Architecture recommendations for a tricky use case

Posted by Ali Akhtar <al...@gmail.com>.

The web UI is actually the speed layer, it needs to be able to query the
data online, and show the results in real-time.

It also needs a custom front-end, so a system like Tableau can't be used,
it must have a custom backend + front-end.

Thanks for the recommendation of Flume. Do you think this will work:

- Spark Streaming to read data from Kafka
- Storing the data on HDFS using Flume
- Using Spark to query the data in the backend of the web UI?



On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> You need a batch layer and a speed layer. Data from Kafka can be stored on
> HDFS using flume.
>
> -  Query this data to generate reports / analytics (There will be a web UI
> which will be the front-end to the data, and will show the reports)
>
> This is basically batch layer and you need something like Tableau or
> Zeppelin to query data
>
> You will also need spark streaming to query data online for speed layer.
> That data could be stored in some transient fabric like ignite or even
> druid.
>
> HTH
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com> wrote:
>
>> It needs to be able to scale to a very large amount of data, yes.
>>
>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma <de...@gmail.com>
>> wrote:
>>
>>> What is the message inflow ?
>>> If it's really high , definitely spark will be of great use .
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:
>>>
>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>
>>>> I have 5-6 Kafka producers, reading various APIs, and writing their raw
>>>> data into Kafka.
>>>>
>>>> I need to:
>>>>
>>>> - Do ETL on the data, and standardize it.
>>>>
>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
>>>> ElasticSearch / Postgres)
>>>>
>>>> - Query this data to generate reports / analytics (There will be a web
>>>> UI which will be the front-end to the data, and will show the reports)
>>>>
>>>> Java is being used as the backend language for everything (backend of
>>>> the web UI, as well as the ETL layer)
>>>>
>>>> I'm considering:
>>>>
>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>> (receive raw data from Kafka, standardize & store it)
>>>>
>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>> data, and to allow queries
>>>>
>>>> - In the backend of the web UI, I could either use Spark to run queries
>>>> across the data (mostly filters), or directly run queries against Cassandra
>>>> / HBase
>>>>
>>>> I'd appreciate some thoughts / suggestions on which of these
>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
>>>> ETL, which persistent data store to use, and how to query that data store
>>>> in the backend of the web UI, for displaying the reports).
>>>>
>>>>
>>>> Thanks.
>>>>
>>>
>>
>

Re: Architecture recommendations for a tricky use case

Posted by Mich Talebzadeh <mi...@gmail.com>.

You need a batch layer and a speed layer. Data from Kafka can be stored on
HDFS using flume.

-  Query this data to generate reports / analytics (There will be a web UI
which will be the front-end to the data, and will show the reports)

This is basically batch layer and you need something like Tableau or
Zeppelin to query data

You will also need spark streaming to query data online for speed layer.
That data could be stored in some transient fabric like ignite or even
druid.

HTH








Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com> wrote:

> It needs to be able to scale to a very large amount of data, yes.
>
> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma <de...@gmail.com>
> wrote:
>
>> What is the message inflow ?
>> If it's really high , definitely spark will be of great use .
>>
>> Thanks
>> Deepak
>>
>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:
>>
>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>
>>> I have 5-6 Kafka producers, reading various APIs, and writing their raw
>>> data into Kafka.
>>>
>>> I need to:
>>>
>>> - Do ETL on the data, and standardize it.
>>>
>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
>>> ElasticSearch / Postgres)
>>>
>>> - Query this data to generate reports / analytics (There will be a web
>>> UI which will be the front-end to the data, and will show the reports)
>>>
>>> Java is being used as the backend language for everything (backend of
>>> the web UI, as well as the ETL layer)
>>>
>>> I'm considering:
>>>
>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>> (receive raw data from Kafka, standardize & store it)
>>>
>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>> data, and to allow queries
>>>
>>> - In the backend of the web UI, I could either use Spark to run queries
>>> across the data (mostly filters), or directly run queries against Cassandra
>>> / HBase
>>>
>>> I'd appreciate some thoughts / suggestions on which of these
>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
>>> ETL, which persistent data store to use, and how to query that data store
>>> in the backend of the web UI, for displaying the reports).
>>>
>>>
>>> Thanks.
>>>
>>
>

Re: Architecture recommendations for a tricky use case

Posted by Mich Talebzadeh <mi...@gmail.com>.

You need a batch layer and a speed layer. Data from Kafka can be stored on
HDFS using flume.

-  Query this data to generate reports / analytics (There will be a web UI
which will be the front-end to the data, and will show the reports)

This is basically batch layer and you need something like Tableau or
Zeppelin to query data

You will also need spark streaming to query data online for speed layer.
That data could be stored in some transient fabric like ignite or even
druid.

HTH








Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 15:01, Ali Akhtar <al...@gmail.com> wrote:

> It needs to be able to scale to a very large amount of data, yes.
>
> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma <de...@gmail.com>
> wrote:
>
>> What is the message inflow ?
>> If it's really high , definitely spark will be of great use .
>>
>> Thanks
>> Deepak
>>
>> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:
>>
>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>
>>> I have 5-6 Kafka producers, reading various APIs, and writing their raw
>>> data into Kafka.
>>>
>>> I need to:
>>>
>>> - Do ETL on the data, and standardize it.
>>>
>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
>>> ElasticSearch / Postgres)
>>>
>>> - Query this data to generate reports / analytics (There will be a web
>>> UI which will be the front-end to the data, and will show the reports)
>>>
>>> Java is being used as the backend language for everything (backend of
>>> the web UI, as well as the ETL layer)
>>>
>>> I'm considering:
>>>
>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>> (receive raw data from Kafka, standardize & store it)
>>>
>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>> data, and to allow queries
>>>
>>> - In the backend of the web UI, I could either use Spark to run queries
>>> across the data (mostly filters), or directly run queries against Cassandra
>>> / HBase
>>>
>>> I'd appreciate some thoughts / suggestions on which of these
>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
>>> ETL, which persistent data store to use, and how to query that data store
>>> in the backend of the web UI, for displaying the reports).
>>>
>>>
>>> Thanks.
>>>
>>
>

Re: Architecture recommendations for a tricky use case

Posted by Ali Akhtar <al...@gmail.com>.

It needs to be able to scale to a very large amount of data, yes.

On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma <de...@gmail.com>
wrote:

> What is the message inflow ?
> If it's really high , definitely spark will be of great use .
>
> Thanks
> Deepak
>
> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:
>
>> I have a somewhat tricky use case, and I'm looking for ideas.
>>
>> I have 5-6 Kafka producers, reading various APIs, and writing their raw
>> data into Kafka.
>>
>> I need to:
>>
>> - Do ETL on the data, and standardize it.
>>
>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
>> ElasticSearch / Postgres)
>>
>> - Query this data to generate reports / analytics (There will be a web UI
>> which will be the front-end to the data, and will show the reports)
>>
>> Java is being used as the backend language for everything (backend of the
>> web UI, as well as the ETL layer)
>>
>> I'm considering:
>>
>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>> (receive raw data from Kafka, standardize & store it)
>>
>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data,
>> and to allow queries
>>
>> - In the backend of the web UI, I could either use Spark to run queries
>> across the data (mostly filters), or directly run queries against Cassandra
>> / HBase
>>
>> I'd appreciate some thoughts / suggestions on which of these alternatives
>> I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which
>> persistent data store to use, and how to query that data store in the
>> backend of the web UI, for displaying the reports).
>>
>>
>> Thanks.
>>
>

Re: Architecture recommendations for a tricky use case

Posted by Ali Akhtar <al...@gmail.com>.

It needs to be able to scale to a very large amount of data, yes.

On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma <de...@gmail.com>
wrote:

> What is the message inflow ?
> If it's really high , definitely spark will be of great use .
>
> Thanks
> Deepak
>
> On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:
>
>> I have a somewhat tricky use case, and I'm looking for ideas.
>>
>> I have 5-6 Kafka producers, reading various APIs, and writing their raw
>> data into Kafka.
>>
>> I need to:
>>
>> - Do ETL on the data, and standardize it.
>>
>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
>> ElasticSearch / Postgres)
>>
>> - Query this data to generate reports / analytics (There will be a web UI
>> which will be the front-end to the data, and will show the reports)
>>
>> Java is being used as the backend language for everything (backend of the
>> web UI, as well as the ETL layer)
>>
>> I'm considering:
>>
>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>> (receive raw data from Kafka, standardize & store it)
>>
>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data,
>> and to allow queries
>>
>> - In the backend of the web UI, I could either use Spark to run queries
>> across the data (mostly filters), or directly run queries against Cassandra
>> / HBase
>>
>> I'd appreciate some thoughts / suggestions on which of these alternatives
>> I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which
>> persistent data store to use, and how to query that data store in the
>> backend of the web UI, for displaying the reports).
>>
>>
>> Thanks.
>>
>

Re: Architecture recommendations for a tricky use case

Posted by Deepak Sharma <de...@gmail.com>.

What is the message inflow ?
If it's really high , definitely spark will be of great use .

Thanks
Deepak

On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:

> I have a somewhat tricky use case, and I'm looking for ideas.
>
> I have 5-6 Kafka producers, reading various APIs, and writing their raw
> data into Kafka.
>
> I need to:
>
> - Do ETL on the data, and standardize it.
>
> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
> ElasticSearch / Postgres)
>
> - Query this data to generate reports / analytics (There will be a web UI
> which will be the front-end to the data, and will show the reports)
>
> Java is being used as the backend language for everything (backend of the
> web UI, as well as the ETL layer)
>
> I'm considering:
>
> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive
> raw data from Kafka, standardize & store it)
>
> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data,
> and to allow queries
>
> - In the backend of the web UI, I could either use Spark to run queries
> across the data (mostly filters), or directly run queries against Cassandra
> / HBase
>
> I'd appreciate some thoughts / suggestions on which of these alternatives
> I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which
> persistent data store to use, and how to query that data store in the
> backend of the web UI, for displaying the reports).
>
>
> Thanks.
>

Re: Architecture recommendations for a tricky use case

Posted by Avi Flax <av...@parkassist.com>.

> On Sep 29, 2016, at 16:39, Ali Akhtar <al...@gmail.com> wrote:
> 
> Why did you choose Druid over Postgres / Cassandra / Elasticsearch?

Well, to be clear, we haven’t chosen it yet — we’re evaluating it.

That said, it is looking quite promising for our use case.

The Druid docs say it well:

> Druid is an open source data store designed for OLAP queries on event data.

And that’s exactly what we need. The other options you listed are excellent systems, but they’re more general than Druid. Because Druid is specifically focused on OLAP queries on event data, it has features and properties that make it very well suited to such use cases.

In addition, Druid has built-in support for ingesting events from Kafka topics and making those events available for querying with very low latency. This is very attractive for my use case.

If you’d like to learn more about Druid I recommend this talk from last month at Strange Loop: https://www.youtube.com/watch?v=vbH8E0nH2Nw

HTH!

Avi

————
Software Architect @ Park Assist
We’re hiring! http://tech.parkassist.com/jobs/

Re: Architecture recommendations for a tricky use case

Posted by Ali Akhtar <al...@gmail.com>.

Avi,

Why did you choose Druid over Postgres / Cassandra / Elasticsearch?

On Fri, Sep 30, 2016 at 1:09 AM, Avi Flax <av...@parkassist.com> wrote:

>
> > On Sep 29, 2016, at 09:54, Ali Akhtar <al...@gmail.com> wrote:
> >
> > I'd appreciate some thoughts / suggestions on which of these
> alternatives I
> > should go with (e.g, using raw Kafka consumers vs Spark for ETL, which
> > persistent data store to use, and how to query that data store in the
> > backend of the web UI, for displaying the reports).
>
> Hi Ali, I’m no expert in any of this, but I’m working on a project that is
> broadly similar to yours, and FWIW I’m evaluating Druid as the datastore
> which would host the queryable data and, well, actually handle and fulfill
> queries.
>
> Since Druid has built-in support for streaming ingestion from Kafka
> topics, I’m tentatively thinking of doing my ETL in a stream processing
> topology (I’m using Kafka Streams, FWIW), which would write the events
> destined for Druid into certain topics, from which Druid would ingest those
> events.
>
> HTH,
> Avi
>
> ————
> Software Architect @ Park Assist
> We’re hiring! http://tech.parkassist.com/jobs/
>
>

Re: Architecture recommendations for a tricky use case

Posted by Avi Flax <av...@parkassist.com>.

> On Sep 29, 2016, at 09:54, Ali Akhtar <al...@gmail.com> wrote:
> 
> I'd appreciate some thoughts / suggestions on which of these alternatives I
> should go with (e.g, using raw Kafka consumers vs Spark for ETL, which
> persistent data store to use, and how to query that data store in the
> backend of the web UI, for displaying the reports).

Hi Ali, I’m no expert in any of this, but I’m working on a project that is broadly similar to yours, and FWIW I’m evaluating Druid as the datastore which would host the queryable data and, well, actually handle and fulfill queries.

Since Druid has built-in support for streaming ingestion from Kafka topics, I’m tentatively thinking of doing my ETL in a stream processing topology (I’m using Kafka Streams, FWIW), which would write the events destined for Druid into certain topics, from which Druid would ingest those events.

HTH,
Avi

————
Software Architect @ Park Assist
We’re hiring! http://tech.parkassist.com/jobs/

Re: Architecture recommendations for a tricky use case

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi Michael,

How about druid <http://druid.io/> here.

Hive ORC tables are another option that have  Streaming data ingest
<https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest>to
Flume and storm

However, Spark cannot read ORC transactional tables because of delta files,
unless the compaction is done (a nightmare)

HTH


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 17:01, Michael Segel <ms...@hotmail.com>
wrote:

> Ok… so what’s the tricky part?
> Spark Streaming isn’t real time so if you don’t mind a slight delay in
> processing… it would work.
>
> The drawback is that you now have a long running Spark Job (assuming under
> YARN) and that could become a problem in terms of security and resources.
> (How well does Yarn handle long running jobs these days in a secured
> Cluster? Steve L. may have some insight… )
>
> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do
> you want to write your own compaction code? Or use Hive 1.x+?)
>
> HBase? Depending on your admin… stability could be a problem.
> Cassandra? That would be a separate cluster and that in itself could be a
> problem…
>
> YMMV so you need to address the pros/cons of each tool specific to your
> environment and skill level.
>
> HTH
>
> -Mike
>
> > On Sep 29, 2016, at 8:54 AM, Ali Akhtar <al...@gmail.com> wrote:
> >
> > I have a somewhat tricky use case, and I'm looking for ideas.
> >
> > I have 5-6 Kafka producers, reading various APIs, and writing their raw
> data into Kafka.
> >
> > I need to:
> >
> > - Do ETL on the data, and standardize it.
> >
> > - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
> ElasticSearch / Postgres)
> >
> > - Query this data to generate reports / analytics (There will be a web
> UI which will be the front-end to the data, and will show the reports)
> >
> > Java is being used as the backend language for everything (backend of
> the web UI, as well as the ETL layer)
> >
> > I'm considering:
> >
> > - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
> (receive raw data from Kafka, standardize & store it)
> >
> > - Using Cassandra, HBase, or raw HDFS, for storing the standardized
> data, and to allow queries
> >
> > - In the backend of the web UI, I could either use Spark to run queries
> across the data (mostly filters), or directly run queries against Cassandra
> / HBase
> >
> > I'd appreciate some thoughts / suggestions on which of these
> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
> ETL, which persistent data store to use, and how to query that data store
> in the backend of the web UI, for displaying the reports).
> >
> >
> > Thanks.
>
>

Re: Architecture recommendations for a tricky use case

Posted by Michael Segel <ms...@hotmail.com>.

OP mentioned HBase or HDFS as persisted storage. Therefore they have to be running YARN if they are considering spark. 
(Assuming that you’re not trying to do a storage / compute model and use standalone spark outside your cluster. You can, but you have more moving parts…) 

I never said anything about putting something on a public network. I mentioned running a secured cluster.
You don’t deal with PII or other regulated data, do you? 


If you read my original post, you are correct we don’t have a lot, if any real information. 
Based on what the OP said, there are design considerations since every tool he mentioned has pluses and minuses and the problem isn’t really that challenging unless you have something extraordinary like high velocity or some other constraint that makes this challenging. 

BTW, depending on scale and velocity… your relational engines may become problematic. 
HTH

-Mike


> On Sep 29, 2016, at 1:51 PM, Cody Koeninger <co...@koeninger.org> wrote:
> 
> The OP didn't say anything about Yarn, and why are you contemplating
> putting Kafka or Spark on public networks to begin with?
> 
> Gwen's right, absent any actual requirements this is kind of pointless.
> 
> On Thu, Sep 29, 2016 at 1:27 PM, Michael Segel
> <ms...@hotmail.com> wrote:
>> Spark standalone is not Yarn… or secure for that matter… ;-)
>> 
>>> On Sep 29, 2016, at 11:18 AM, Cody Koeninger <co...@koeninger.org> wrote:
>>> 
>>> Spark streaming helps with aggregation because
>>> 
>>> A. raw kafka consumers have no built in framework for shuffling
>>> amongst nodes, short of writing into an intermediate topic (I'm not
>>> touching Kafka Streams here, I don't have experience), and
>>> 
>>> B. it deals with batches, so you can transactionally decide to commit
>>> or rollback your aggregate data and your offsets.  Otherwise your
>>> offsets and data store can get out of sync, leading to lost /
>>> duplicate data.
>>> 
>>> Regarding long running spark jobs, I have streaming jobs in the
>>> standalone manager that have been running for 6 months or more.
>>> 
>>> On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel
>>> <ms...@hotmail.com> wrote:
>>>> Ok… so what’s the tricky part?
>>>> Spark Streaming isn’t real time so if you don’t mind a slight delay in processing… it would work.
>>>> 
>>>> The drawback is that you now have a long running Spark Job (assuming under YARN) and that could become a problem in terms of security and resources.
>>>> (How well does Yarn handle long running jobs these days in a secured Cluster? Steve L. may have some insight… )
>>>> 
>>>> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you want to write your own compaction code? Or use Hive 1.x+?)
>>>> 
>>>> HBase? Depending on your admin… stability could be a problem.
>>>> Cassandra? That would be a separate cluster and that in itself could be a problem…
>>>> 
>>>> YMMV so you need to address the pros/cons of each tool specific to your environment and skill level.
>>>> 
>>>> HTH
>>>> 
>>>> -Mike
>>>> 
>>>>> On Sep 29, 2016, at 8:54 AM, Ali Akhtar <al...@gmail.com> wrote:
>>>>> 
>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>> 
>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their raw data into Kafka.
>>>>> 
>>>>> I need to:
>>>>> 
>>>>> - Do ETL on the data, and standardize it.
>>>>> 
>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / ElasticSearch / Postgres)
>>>>> 
>>>>> - Query this data to generate reports / analytics (There will be a web UI which will be the front-end to the data, and will show the reports)
>>>>> 
>>>>> Java is being used as the backend language for everything (backend of the web UI, as well as the ETL layer)
>>>>> 
>>>>> I'm considering:
>>>>> 
>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive raw data from Kafka, standardize & store it)
>>>>> 
>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, and to allow queries
>>>>> 
>>>>> - In the backend of the web UI, I could either use Spark to run queries across the data (mostly filters), or directly run queries against Cassandra / HBase
>>>>> 
>>>>> I'd appreciate some thoughts / suggestions on which of these alternatives I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which persistent data store to use, and how to query that data store in the backend of the web UI, for displaying the reports).
>>>>> 
>>>>> 
>>>>> Thanks.
>>>> 
>> 


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Architecture recommendations for a tricky use case

Posted by Michael Segel <ms...@hotmail.com>.

OP mentioned HBase or HDFS as persisted storage. Therefore they have to be running YARN if they are considering spark. 
(Assuming that you’re not trying to do a storage / compute model and use standalone spark outside your cluster. You can, but you have more moving parts…) 

I never said anything about putting something on a public network. I mentioned running a secured cluster.
You don’t deal with PII or other regulated data, do you? 


If you read my original post, you are correct we don’t have a lot, if any real information. 
Based on what the OP said, there are design considerations since every tool he mentioned has pluses and minuses and the problem isn’t really that challenging unless you have something extraordinary like high velocity or some other constraint that makes this challenging. 

BTW, depending on scale and velocity… your relational engines may become problematic. 
HTH

-Mike


> On Sep 29, 2016, at 1:51 PM, Cody Koeninger <co...@koeninger.org> wrote:
> 
> The OP didn't say anything about Yarn, and why are you contemplating
> putting Kafka or Spark on public networks to begin with?
> 
> Gwen's right, absent any actual requirements this is kind of pointless.
> 
> On Thu, Sep 29, 2016 at 1:27 PM, Michael Segel
> <ms...@hotmail.com> wrote:
>> Spark standalone is not Yarn… or secure for that matter… ;-)
>> 
>>> On Sep 29, 2016, at 11:18 AM, Cody Koeninger <co...@koeninger.org> wrote:
>>> 
>>> Spark streaming helps with aggregation because
>>> 
>>> A. raw kafka consumers have no built in framework for shuffling
>>> amongst nodes, short of writing into an intermediate topic (I'm not
>>> touching Kafka Streams here, I don't have experience), and
>>> 
>>> B. it deals with batches, so you can transactionally decide to commit
>>> or rollback your aggregate data and your offsets.  Otherwise your
>>> offsets and data store can get out of sync, leading to lost /
>>> duplicate data.
>>> 
>>> Regarding long running spark jobs, I have streaming jobs in the
>>> standalone manager that have been running for 6 months or more.
>>> 
>>> On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel
>>> <ms...@hotmail.com> wrote:
>>>> Ok… so what’s the tricky part?
>>>> Spark Streaming isn’t real time so if you don’t mind a slight delay in processing… it would work.
>>>> 
>>>> The drawback is that you now have a long running Spark Job (assuming under YARN) and that could become a problem in terms of security and resources.
>>>> (How well does Yarn handle long running jobs these days in a secured Cluster? Steve L. may have some insight… )
>>>> 
>>>> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you want to write your own compaction code? Or use Hive 1.x+?)
>>>> 
>>>> HBase? Depending on your admin… stability could be a problem.
>>>> Cassandra? That would be a separate cluster and that in itself could be a problem…
>>>> 
>>>> YMMV so you need to address the pros/cons of each tool specific to your environment and skill level.
>>>> 
>>>> HTH
>>>> 
>>>> -Mike
>>>> 
>>>>> On Sep 29, 2016, at 8:54 AM, Ali Akhtar <al...@gmail.com> wrote:
>>>>> 
>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>> 
>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their raw data into Kafka.
>>>>> 
>>>>> I need to:
>>>>> 
>>>>> - Do ETL on the data, and standardize it.
>>>>> 
>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / ElasticSearch / Postgres)
>>>>> 
>>>>> - Query this data to generate reports / analytics (There will be a web UI which will be the front-end to the data, and will show the reports)
>>>>> 
>>>>> Java is being used as the backend language for everything (backend of the web UI, as well as the ETL layer)
>>>>> 
>>>>> I'm considering:
>>>>> 
>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive raw data from Kafka, standardize & store it)
>>>>> 
>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, and to allow queries
>>>>> 
>>>>> - In the backend of the web UI, I could either use Spark to run queries across the data (mostly filters), or directly run queries against Cassandra / HBase
>>>>> 
>>>>> I'd appreciate some thoughts / suggestions on which of these alternatives I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which persistent data store to use, and how to query that data store in the backend of the web UI, for displaying the reports).
>>>>> 
>>>>> 
>>>>> Thanks.
>>>> 
>>

Re: Architecture recommendations for a tricky use case

Posted by Cody Koeninger <co...@koeninger.org>.

The OP didn't say anything about Yarn, and why are you contemplating
putting Kafka or Spark on public networks to begin with?

Gwen's right, absent any actual requirements this is kind of pointless.

On Thu, Sep 29, 2016 at 1:27 PM, Michael Segel
<ms...@hotmail.com> wrote:
> Spark standalone is not Yarn… or secure for that matter… ;-)
>
>> On Sep 29, 2016, at 11:18 AM, Cody Koeninger <co...@koeninger.org> wrote:
>>
>> Spark streaming helps with aggregation because
>>
>> A. raw kafka consumers have no built in framework for shuffling
>> amongst nodes, short of writing into an intermediate topic (I'm not
>> touching Kafka Streams here, I don't have experience), and
>>
>> B. it deals with batches, so you can transactionally decide to commit
>> or rollback your aggregate data and your offsets.  Otherwise your
>> offsets and data store can get out of sync, leading to lost /
>> duplicate data.
>>
>> Regarding long running spark jobs, I have streaming jobs in the
>> standalone manager that have been running for 6 months or more.
>>
>> On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel
>> <ms...@hotmail.com> wrote:
>>> Ok… so what’s the tricky part?
>>> Spark Streaming isn’t real time so if you don’t mind a slight delay in processing… it would work.
>>>
>>> The drawback is that you now have a long running Spark Job (assuming under YARN) and that could become a problem in terms of security and resources.
>>> (How well does Yarn handle long running jobs these days in a secured Cluster? Steve L. may have some insight… )
>>>
>>> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you want to write your own compaction code? Or use Hive 1.x+?)
>>>
>>> HBase? Depending on your admin… stability could be a problem.
>>> Cassandra? That would be a separate cluster and that in itself could be a problem…
>>>
>>> YMMV so you need to address the pros/cons of each tool specific to your environment and skill level.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>>> On Sep 29, 2016, at 8:54 AM, Ali Akhtar <al...@gmail.com> wrote:
>>>>
>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>
>>>> I have 5-6 Kafka producers, reading various APIs, and writing their raw data into Kafka.
>>>>
>>>> I need to:
>>>>
>>>> - Do ETL on the data, and standardize it.
>>>>
>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / ElasticSearch / Postgres)
>>>>
>>>> - Query this data to generate reports / analytics (There will be a web UI which will be the front-end to the data, and will show the reports)
>>>>
>>>> Java is being used as the backend language for everything (backend of the web UI, as well as the ETL layer)
>>>>
>>>> I'm considering:
>>>>
>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive raw data from Kafka, standardize & store it)
>>>>
>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, and to allow queries
>>>>
>>>> - In the backend of the web UI, I could either use Spark to run queries across the data (mostly filters), or directly run queries against Cassandra / HBase
>>>>
>>>> I'd appreciate some thoughts / suggestions on which of these alternatives I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which persistent data store to use, and how to query that data store in the backend of the web UI, for displaying the reports).
>>>>
>>>>
>>>> Thanks.
>>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Architecture recommendations for a tricky use case

Posted by Cody Koeninger <co...@koeninger.org>.

The OP didn't say anything about Yarn, and why are you contemplating
putting Kafka or Spark on public networks to begin with?

Gwen's right, absent any actual requirements this is kind of pointless.

On Thu, Sep 29, 2016 at 1:27 PM, Michael Segel
<ms...@hotmail.com> wrote:
> Spark standalone is not Yarn… or secure for that matter… ;-)
>
>> On Sep 29, 2016, at 11:18 AM, Cody Koeninger <co...@koeninger.org> wrote:
>>
>> Spark streaming helps with aggregation because
>>
>> A. raw kafka consumers have no built in framework for shuffling
>> amongst nodes, short of writing into an intermediate topic (I'm not
>> touching Kafka Streams here, I don't have experience), and
>>
>> B. it deals with batches, so you can transactionally decide to commit
>> or rollback your aggregate data and your offsets.  Otherwise your
>> offsets and data store can get out of sync, leading to lost /
>> duplicate data.
>>
>> Regarding long running spark jobs, I have streaming jobs in the
>> standalone manager that have been running for 6 months or more.
>>
>> On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel
>> <ms...@hotmail.com> wrote:
>>> Ok… so what’s the tricky part?
>>> Spark Streaming isn’t real time so if you don’t mind a slight delay in processing… it would work.
>>>
>>> The drawback is that you now have a long running Spark Job (assuming under YARN) and that could become a problem in terms of security and resources.
>>> (How well does Yarn handle long running jobs these days in a secured Cluster? Steve L. may have some insight… )
>>>
>>> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you want to write your own compaction code? Or use Hive 1.x+?)
>>>
>>> HBase? Depending on your admin… stability could be a problem.
>>> Cassandra? That would be a separate cluster and that in itself could be a problem…
>>>
>>> YMMV so you need to address the pros/cons of each tool specific to your environment and skill level.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>>> On Sep 29, 2016, at 8:54 AM, Ali Akhtar <al...@gmail.com> wrote:
>>>>
>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>
>>>> I have 5-6 Kafka producers, reading various APIs, and writing their raw data into Kafka.
>>>>
>>>> I need to:
>>>>
>>>> - Do ETL on the data, and standardize it.
>>>>
>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / ElasticSearch / Postgres)
>>>>
>>>> - Query this data to generate reports / analytics (There will be a web UI which will be the front-end to the data, and will show the reports)
>>>>
>>>> Java is being used as the backend language for everything (backend of the web UI, as well as the ETL layer)
>>>>
>>>> I'm considering:
>>>>
>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive raw data from Kafka, standardize & store it)
>>>>
>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, and to allow queries
>>>>
>>>> - In the backend of the web UI, I could either use Spark to run queries across the data (mostly filters), or directly run queries against Cassandra / HBase
>>>>
>>>> I'd appreciate some thoughts / suggestions on which of these alternatives I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which persistent data store to use, and how to query that data store in the backend of the web UI, for displaying the reports).
>>>>
>>>>
>>>> Thanks.
>>>
>

Re: Architecture recommendations for a tricky use case

Posted by Michael Segel <ms...@hotmail.com>.

Spark standalone is not Yarn… or secure for that matter… ;-)

> On Sep 29, 2016, at 11:18 AM, Cody Koeninger <co...@koeninger.org> wrote:
> 
> Spark streaming helps with aggregation because
> 
> A. raw kafka consumers have no built in framework for shuffling
> amongst nodes, short of writing into an intermediate topic (I'm not
> touching Kafka Streams here, I don't have experience), and
> 
> B. it deals with batches, so you can transactionally decide to commit
> or rollback your aggregate data and your offsets.  Otherwise your
> offsets and data store can get out of sync, leading to lost /
> duplicate data.
> 
> Regarding long running spark jobs, I have streaming jobs in the
> standalone manager that have been running for 6 months or more.
> 
> On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel
> <ms...@hotmail.com> wrote:
>> Ok… so what’s the tricky part?
>> Spark Streaming isn’t real time so if you don’t mind a slight delay in processing… it would work.
>> 
>> The drawback is that you now have a long running Spark Job (assuming under YARN) and that could become a problem in terms of security and resources.
>> (How well does Yarn handle long running jobs these days in a secured Cluster? Steve L. may have some insight… )
>> 
>> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you want to write your own compaction code? Or use Hive 1.x+?)
>> 
>> HBase? Depending on your admin… stability could be a problem.
>> Cassandra? That would be a separate cluster and that in itself could be a problem…
>> 
>> YMMV so you need to address the pros/cons of each tool specific to your environment and skill level.
>> 
>> HTH
>> 
>> -Mike
>> 
>>> On Sep 29, 2016, at 8:54 AM, Ali Akhtar <al...@gmail.com> wrote:
>>> 
>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>> 
>>> I have 5-6 Kafka producers, reading various APIs, and writing their raw data into Kafka.
>>> 
>>> I need to:
>>> 
>>> - Do ETL on the data, and standardize it.
>>> 
>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / ElasticSearch / Postgres)
>>> 
>>> - Query this data to generate reports / analytics (There will be a web UI which will be the front-end to the data, and will show the reports)
>>> 
>>> Java is being used as the backend language for everything (backend of the web UI, as well as the ETL layer)
>>> 
>>> I'm considering:
>>> 
>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive raw data from Kafka, standardize & store it)
>>> 
>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, and to allow queries
>>> 
>>> - In the backend of the web UI, I could either use Spark to run queries across the data (mostly filters), or directly run queries against Cassandra / HBase
>>> 
>>> I'd appreciate some thoughts / suggestions on which of these alternatives I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which persistent data store to use, and how to query that data store in the backend of the web UI, for displaying the reports).
>>> 
>>> 
>>> Thanks.
>>

Re: Architecture recommendations for a tricky use case

Posted by Michael Segel <ms...@hotmail.com>.

Spark standalone is not Yarn… or secure for that matter… ;-)

> On Sep 29, 2016, at 11:18 AM, Cody Koeninger <co...@koeninger.org> wrote:
> 
> Spark streaming helps with aggregation because
> 
> A. raw kafka consumers have no built in framework for shuffling
> amongst nodes, short of writing into an intermediate topic (I'm not
> touching Kafka Streams here, I don't have experience), and
> 
> B. it deals with batches, so you can transactionally decide to commit
> or rollback your aggregate data and your offsets.  Otherwise your
> offsets and data store can get out of sync, leading to lost /
> duplicate data.
> 
> Regarding long running spark jobs, I have streaming jobs in the
> standalone manager that have been running for 6 months or more.
> 
> On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel
> <ms...@hotmail.com> wrote:
>> Ok… so what’s the tricky part?
>> Spark Streaming isn’t real time so if you don’t mind a slight delay in processing… it would work.
>> 
>> The drawback is that you now have a long running Spark Job (assuming under YARN) and that could become a problem in terms of security and resources.
>> (How well does Yarn handle long running jobs these days in a secured Cluster? Steve L. may have some insight… )
>> 
>> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you want to write your own compaction code? Or use Hive 1.x+?)
>> 
>> HBase? Depending on your admin… stability could be a problem.
>> Cassandra? That would be a separate cluster and that in itself could be a problem…
>> 
>> YMMV so you need to address the pros/cons of each tool specific to your environment and skill level.
>> 
>> HTH
>> 
>> -Mike
>> 
>>> On Sep 29, 2016, at 8:54 AM, Ali Akhtar <al...@gmail.com> wrote:
>>> 
>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>> 
>>> I have 5-6 Kafka producers, reading various APIs, and writing their raw data into Kafka.
>>> 
>>> I need to:
>>> 
>>> - Do ETL on the data, and standardize it.
>>> 
>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / ElasticSearch / Postgres)
>>> 
>>> - Query this data to generate reports / analytics (There will be a web UI which will be the front-end to the data, and will show the reports)
>>> 
>>> Java is being used as the backend language for everything (backend of the web UI, as well as the ETL layer)
>>> 
>>> I'm considering:
>>> 
>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive raw data from Kafka, standardize & store it)
>>> 
>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, and to allow queries
>>> 
>>> - In the backend of the web UI, I could either use Spark to run queries across the data (mostly filters), or directly run queries against Cassandra / HBase
>>> 
>>> I'd appreciate some thoughts / suggestions on which of these alternatives I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which persistent data store to use, and how to query that data store in the backend of the web UI, for displaying the reports).
>>> 
>>> 
>>> Thanks.
>>

Re: Architecture recommendations for a tricky use case

Posted by Cody Koeninger <co...@koeninger.org>.

Spark streaming helps with aggregation because

A. raw kafka consumers have no built in framework for shuffling
amongst nodes, short of writing into an intermediate topic (I'm not
touching Kafka Streams here, I don't have experience), and

B. it deals with batches, so you can transactionally decide to commit
or rollback your aggregate data and your offsets.  Otherwise your
offsets and data store can get out of sync, leading to lost /
duplicate data.

Regarding long running spark jobs, I have streaming jobs in the
standalone manager that have been running for 6 months or more.

On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel
<ms...@hotmail.com> wrote:
> Ok… so what’s the tricky part?
> Spark Streaming isn’t real time so if you don’t mind a slight delay in processing… it would work.
>
> The drawback is that you now have a long running Spark Job (assuming under YARN) and that could become a problem in terms of security and resources.
> (How well does Yarn handle long running jobs these days in a secured Cluster? Steve L. may have some insight… )
>
> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you want to write your own compaction code? Or use Hive 1.x+?)
>
> HBase? Depending on your admin… stability could be a problem.
> Cassandra? That would be a separate cluster and that in itself could be a problem…
>
> YMMV so you need to address the pros/cons of each tool specific to your environment and skill level.
>
> HTH
>
> -Mike
>
>> On Sep 29, 2016, at 8:54 AM, Ali Akhtar <al...@gmail.com> wrote:
>>
>> I have a somewhat tricky use case, and I'm looking for ideas.
>>
>> I have 5-6 Kafka producers, reading various APIs, and writing their raw data into Kafka.
>>
>> I need to:
>>
>> - Do ETL on the data, and standardize it.
>>
>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / ElasticSearch / Postgres)
>>
>> - Query this data to generate reports / analytics (There will be a web UI which will be the front-end to the data, and will show the reports)
>>
>> Java is being used as the backend language for everything (backend of the web UI, as well as the ETL layer)
>>
>> I'm considering:
>>
>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive raw data from Kafka, standardize & store it)
>>
>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, and to allow queries
>>
>> - In the backend of the web UI, I could either use Spark to run queries across the data (mostly filters), or directly run queries against Cassandra / HBase
>>
>> I'd appreciate some thoughts / suggestions on which of these alternatives I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which persistent data store to use, and how to query that data store in the backend of the web UI, for displaying the reports).
>>
>>
>> Thanks.
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Architecture recommendations for a tricky use case

Posted by Cody Koeninger <co...@koeninger.org>.

Spark streaming helps with aggregation because

A. raw kafka consumers have no built in framework for shuffling
amongst nodes, short of writing into an intermediate topic (I'm not
touching Kafka Streams here, I don't have experience), and

B. it deals with batches, so you can transactionally decide to commit
or rollback your aggregate data and your offsets.  Otherwise your
offsets and data store can get out of sync, leading to lost /
duplicate data.

Regarding long running spark jobs, I have streaming jobs in the
standalone manager that have been running for 6 months or more.

On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel
<ms...@hotmail.com> wrote:
> Ok… so what’s the tricky part?
> Spark Streaming isn’t real time so if you don’t mind a slight delay in processing… it would work.
>
> The drawback is that you now have a long running Spark Job (assuming under YARN) and that could become a problem in terms of security and resources.
> (How well does Yarn handle long running jobs these days in a secured Cluster? Steve L. may have some insight… )
>
> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you want to write your own compaction code? Or use Hive 1.x+?)
>
> HBase? Depending on your admin… stability could be a problem.
> Cassandra? That would be a separate cluster and that in itself could be a problem…
>
> YMMV so you need to address the pros/cons of each tool specific to your environment and skill level.
>
> HTH
>
> -Mike
>
>> On Sep 29, 2016, at 8:54 AM, Ali Akhtar <al...@gmail.com> wrote:
>>
>> I have a somewhat tricky use case, and I'm looking for ideas.
>>
>> I have 5-6 Kafka producers, reading various APIs, and writing their raw data into Kafka.
>>
>> I need to:
>>
>> - Do ETL on the data, and standardize it.
>>
>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / ElasticSearch / Postgres)
>>
>> - Query this data to generate reports / analytics (There will be a web UI which will be the front-end to the data, and will show the reports)
>>
>> Java is being used as the backend language for everything (backend of the web UI, as well as the ETL layer)
>>
>> I'm considering:
>>
>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive raw data from Kafka, standardize & store it)
>>
>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, and to allow queries
>>
>> - In the backend of the web UI, I could either use Spark to run queries across the data (mostly filters), or directly run queries against Cassandra / HBase
>>
>> I'd appreciate some thoughts / suggestions on which of these alternatives I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which persistent data store to use, and how to query that data store in the backend of the web UI, for displaying the reports).
>>
>>
>> Thanks.
>

Re: Architecture recommendations for a tricky use case

Posted by Michael Segel <ms...@hotmail.com>.

Ok… so what’s the tricky part? 
Spark Streaming isn’t real time so if you don’t mind a slight delay in processing… it would work.

The drawback is that you now have a long running Spark Job (assuming under YARN) and that could become a problem in terms of security and resources. 
(How well does Yarn handle long running jobs these days in a secured Cluster? Steve L. may have some insight… ) 

Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you want to write your own compaction code? Or use Hive 1.x+?)

HBase? Depending on your admin… stability could be a problem. 
Cassandra? That would be a separate cluster and that in itself could be a problem… 

YMMV so you need to address the pros/cons of each tool specific to your environment and skill level. 

HTH

-Mike

> On Sep 29, 2016, at 8:54 AM, Ali Akhtar <al...@gmail.com> wrote:
> 
> I have a somewhat tricky use case, and I'm looking for ideas.
> 
> I have 5-6 Kafka producers, reading various APIs, and writing their raw data into Kafka.
> 
> I need to:
> 
> - Do ETL on the data, and standardize it.
> 
> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / ElasticSearch / Postgres)
> 
> - Query this data to generate reports / analytics (There will be a web UI which will be the front-end to the data, and will show the reports)
> 
> Java is being used as the backend language for everything (backend of the web UI, as well as the ETL layer)
> 
> I'm considering:
> 
> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive raw data from Kafka, standardize & store it)
> 
> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, and to allow queries
> 
> - In the backend of the web UI, I could either use Spark to run queries across the data (mostly filters), or directly run queries against Cassandra / HBase
> 
> I'd appreciate some thoughts / suggestions on which of these alternatives I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which persistent data store to use, and how to query that data store in the backend of the web UI, for displaying the reports).
> 
> 
> Thanks.

Re: Architecture recommendations for a tricky use case

Posted by Deepak Sharma <de...@gmail.com>.

What is the message inflow ?
If it's really high , definitely spark will be of great use .

Thanks
Deepak

On Sep 29, 2016 19:24, "Ali Akhtar" <al...@gmail.com> wrote:

> I have a somewhat tricky use case, and I'm looking for ideas.
>
> I have 5-6 Kafka producers, reading various APIs, and writing their raw
> data into Kafka.
>
> I need to:
>
> - Do ETL on the data, and standardize it.
>
> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
> ElasticSearch / Postgres)
>
> - Query this data to generate reports / analytics (There will be a web UI
> which will be the front-end to the data, and will show the reports)
>
> Java is being used as the backend language for everything (backend of the
> web UI, as well as the ETL layer)
>
> I'm considering:
>
> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive
> raw data from Kafka, standardize & store it)
>
> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data,
> and to allow queries
>
> - In the backend of the web UI, I could either use Spark to run queries
> across the data (mostly filters), or directly run queries against Cassandra
> / HBase
>
> I'd appreciate some thoughts / suggestions on which of these alternatives
> I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which
> persistent data store to use, and how to query that data store in the
> backend of the web UI, for displaying the reports).
>
>
> Thanks.
>

Re: Architecture recommendations for a tricky use case

Posted by Michael Segel <ms...@hotmail.com>.

Ok… so what’s the tricky part? 
Spark Streaming isn’t real time so if you don’t mind a slight delay in processing… it would work.

The drawback is that you now have a long running Spark Job (assuming under YARN) and that could become a problem in terms of security and resources. 
(How well does Yarn handle long running jobs these days in a secured Cluster? Steve L. may have some insight… ) 

Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you want to write your own compaction code? Or use Hive 1.x+?)

HBase? Depending on your admin… stability could be a problem. 
Cassandra? That would be a separate cluster and that in itself could be a problem… 

YMMV so you need to address the pros/cons of each tool specific to your environment and skill level. 

HTH

-Mike

> On Sep 29, 2016, at 8:54 AM, Ali Akhtar <al...@gmail.com> wrote:
> 
> I have a somewhat tricky use case, and I'm looking for ideas.
> 
> I have 5-6 Kafka producers, reading various APIs, and writing their raw data into Kafka.
> 
> I need to:
> 
> - Do ETL on the data, and standardize it.
> 
> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / ElasticSearch / Postgres)
> 
> - Query this data to generate reports / analytics (There will be a web UI which will be the front-end to the data, and will show the reports)
> 
> Java is being used as the backend language for everything (backend of the web UI, as well as the ETL layer)
> 
> I'm considering:
> 
> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive raw data from Kafka, standardize & store it)
> 
> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, and to allow queries
> 
> - In the backend of the web UI, I could either use Spark to run queries across the data (mostly filters), or directly run queries against Cassandra / HBase
> 
> I'd appreciate some thoughts / suggestions on which of these alternatives I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which persistent data store to use, and how to query that data store in the backend of the web UI, for displaying the reports).
> 
> 
> Thanks.