You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2016/09/15 08:35:29 UTC

Best way to present data collected by Flume through Spark

Hi,

This is for fishing for some ideas.

In the design we get prices directly through Kafka into Flume and store
it on HDFS as text files
We can then use Spark with Zeppelin to present data to the users.

This works. However, I am aware that once the volume of flat files rises
one needs to do housekeeping. You don't want to read all files every time.

A more viable alternative would be to read data into some form of tables
(Hive etc) periodically through an hourly cron set up so batch process will
have up to date and accurate data up to last hour.

That certainly be an easier option for the users as well.

I was wondering what would be the best strategy here. Druid, Hive others?

The business case here is that users may want to access older data so a
database of some sort will be a better solution? In all likelihood they
want a week's data.

Thanks

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Re: Best way to present data collected by Flume through Spark

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi Sean,

At the moment I am using Zeppelin with Spark SQL to get data from Hive. So
any connection here using visitation has to be through this sort of API.

I know Tableau only uses SQL. Zeppelin can use Spark sql directly or
through Spark Thrift Server.

The question is a user may want to create a join or something involving
many tables and the preference would be to use some sort of database.

In this case Hive is running on Spark engine so we are not talking about
Map-reduce and the associated latency.

That Hive element can be easily plugged out. So our requirement is to
present multiple tables to dashboard and let the user slice and dice.

The factors are not just speed but also the functionality. At the moment
Zeppelin uses Spark SQL. I can get rid of Hive and replace it with another
but I think I still need to have a tabular interface to Flume delivered
data.

I will be happy to consider all options

Thanks

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 16 September 2016 at 08:46, Sean Owen <so...@cloudera.com> wrote:

> Why Hive and why precompute data at 15 minute latency? there are
> several ways here to query the source data directly with no extra step
> or latency here. Even Spark SQL is real-time-ish for queries on the
> source data, and Impala (or heck Drill etc) are.
>
> On Thu, Sep 15, 2016 at 10:56 PM, Mich Talebzadeh
> <mi...@gmail.com> wrote:
> > OK this seems to be working for the "Batch layer". I will try to create a
> > functional diagram for it
> >
> > Publisher sends prices every two seconds
> > Kafka receives data
> > Flume delivers data from Kafka to HDFS on text files time stamped
> > A Hive ORC external table (source table) is created on the directory
> where
> > flume writes continuously
> > All temporary flume tables are prefixed by "." (hidden files), so Hive
> > external table does not see those
> > Every price row includes a timestamp
> > A conventional Hive table (target table) is created with all columns from
> > the external table + two additional columns with one being a timestamp
> from
> > Hive
> > A cron job set up that runs ever 15 minutes  as below
> > 0,15,30,45 00-23 * * 1-5 (/home/hduser/dba/bin/populate_marketData.ksh
> -D
> > test > /var/tmp/populate_marketData_test.err 2>&1)
> >
> > This cron as can be seen runs runs every 15 minutes and refreshes the
> Hive
> > target table with the new data. New data meaning the price created time >
> > MAX(price created time) from the target table
> >
> > Target table statistics are updated at each run. It takes an average of 2
> > minutes to run the job
> > Thu Sep 15 22:45:01 BST 2016  ======= Started
> > /home/hduser/dba/bin/populate_marketData.ksh  =======
> > 15/09/2016 22:45:09.09
> > 15/09/2016 22:46:57.57
> > 2016-09-15T22:46:10
> > 2016-09-15T22:46:57
> > Thu Sep 15 22:47:21 BST 2016  ======= Completed
> > /home/hduser/dba/bin/populate_marketData.ksh  =======
> >
> >
> > So the target table is 15 minutes out of sync with flume data which is
> not
> > bad.
> >
> > Assuming that I replace ORC tables with Parquet, druid whatever, that
> can be
> > done pretty easily. However, although I am using Zeppelin here, people
> may
> > decide to use Tableau, QlikView etc which we need to think about the
> > connectivity between these notebooks and the underlying database. I know
> > Tableau and it is very SQL centric and works with ODBC and JDBC drivers
> or
> > native drivers. For example I know that Tableau comes with Hive supplied
> > ODBC drivers. I am not sure these database have drivers for Druid etc?
> >
> > Let me know your thoughts.
> >
> > Cheers
> >
> > Dr Mich Talebzadeh
> >
>

Re: Best way to present data collected by Flume through Spark

Posted by Sean Owen <so...@cloudera.com>.

Why Hive and why precompute data at 15 minute latency? there are
several ways here to query the source data directly with no extra step
or latency here. Even Spark SQL is real-time-ish for queries on the
source data, and Impala (or heck Drill etc) are.

On Thu, Sep 15, 2016 at 10:56 PM, Mich Talebzadeh
<mi...@gmail.com> wrote:
> OK this seems to be working for the "Batch layer". I will try to create a
> functional diagram for it
>
> Publisher sends prices every two seconds
> Kafka receives data
> Flume delivers data from Kafka to HDFS on text files time stamped
> A Hive ORC external table (source table) is created on the directory where
> flume writes continuously
> All temporary flume tables are prefixed by "." (hidden files), so Hive
> external table does not see those
> Every price row includes a timestamp
> A conventional Hive table (target table) is created with all columns from
> the external table + two additional columns with one being a timestamp from
> Hive
> A cron job set up that runs ever 15 minutes  as below
> 0,15,30,45 00-23 * * 1-5 (/home/hduser/dba/bin/populate_marketData.ksh -D
> test > /var/tmp/populate_marketData_test.err 2>&1)
>
> This cron as can be seen runs runs every 15 minutes and refreshes the Hive
> target table with the new data. New data meaning the price created time >
> MAX(price created time) from the target table
>
> Target table statistics are updated at each run. It takes an average of 2
> minutes to run the job
> Thu Sep 15 22:45:01 BST 2016  ======= Started
> /home/hduser/dba/bin/populate_marketData.ksh  =======
> 15/09/2016 22:45:09.09
> 15/09/2016 22:46:57.57
> 2016-09-15T22:46:10
> 2016-09-15T22:46:57
> Thu Sep 15 22:47:21 BST 2016  ======= Completed
> /home/hduser/dba/bin/populate_marketData.ksh  =======
>
>
> So the target table is 15 minutes out of sync with flume data which is not
> bad.
>
> Assuming that I replace ORC tables with Parquet, druid whatever, that can be
> done pretty easily. However, although I am using Zeppelin here, people may
> decide to use Tableau, QlikView etc which we need to think about the
> connectivity between these notebooks and the underlying database. I know
> Tableau and it is very SQL centric and works with ODBC and JDBC drivers or
> native drivers. For example I know that Tableau comes with Hive supplied
> ODBC drivers. I am not sure these database have drivers for Druid etc?
>
> Let me know your thoughts.
>
> Cheers
>
> Dr Mich Talebzadeh
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Best way to present data collected by Flume through Spark

Posted by Mich Talebzadeh <mi...@gmail.com>.

OK this seems to be working for the "Batch layer". I will try to create a
functional diagram for it


   1. Publisher sends prices every two seconds
   2. Kafka receives data
   3. Flume delivers data from Kafka to HDFS on text files time stamped
   4. A Hive ORC external table (source table) is created on the directory
   where flume writes continuously
   5. All temporary flume tables are prefixed by "." (hidden files), so
   Hive external table does not see those
   6. Every price row includes a timestamp
   7. A conventional Hive table (target table) is created with all columns
   from the external table + two additional columns with one being a timestamp
   from Hive
   8. A cron job set up that runs ever 15 minutes  as below
   9. 0,15,30,45 00-23 * * 1-5
   (/home/hduser/dba/bin/populate_marketData.ksh -D test >
   /var/tmp/populate_marketData_test.err 2>&1)

   10. This cron as can be seen runs runs every 15 minutes and refreshes
   the Hive target table with the new data. New data meaning the price created
   time > MAX(price created time) from the target table
      1. Target table statistics are updated at each run. It takes an
      average of 2 minutes to run the job
      2. Thu Sep 15 22:45:01 BST 2016  ======= Started
      /home/hduser/dba/bin/populate_marketData.ksh  =======
      15/09/2016 22:45:09.09
      15/09/2016 22:46:57.57
      2016-09-15T22:46:10
      2016-09-15T22:46:57
      Thu Sep 15 22:47:21 BST 2016  ======= Completed
      /home/hduser/dba/bin/populate_marketData.ksh  =======


So the target table is 15 minutes out of sync with flume data which is not
bad.

Assuming that I replace ORC tables with Parquet, druid whatever, that can
be done pretty easily. However, although I am using Zeppelin here, people
may decide to use Tableau, QlikView etc which we need to think about the
connectivity between these notebooks and the underlying database. I know
Tableau and it is very SQL centric and works with ODBC and JDBC drivers or
native drivers. For example I know that Tableau comes with Hive supplied
ODBC drivers. I am not sure these database have drivers for Druid etc?

Let me know your thoughts.

Cheers

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 15 September 2016 at 16:35, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Thanks guys.
>
> This is I would propose to  proceed.
>
>
>    1. For speed layer. get data through Spark Streaming (SS), loop over
>    rdd for individual prices, do some work and alert the trader. This works OK
>    but obviously is expensive resource wise
>    2. In the same SS one cannot process RDD to disk  -->
>    cachedRDD.saveAsTextFile("/data/prices/prices_" +
>    System.currentTimeMillis.toString) and also deal with individual
>    prices. It is too time consuming.
>    3. Given 2 decided to keep individual prices ONLY processing through
>    SS
>    4. Decided to use Flume to hook to Kafka and write the incoming
>    messages to HDFS. The same Kafka now feed SS and Flume. This works fine but
>    one ends up with loads of files
>    5. I prefer to get the prices into HDFS as is (golden source etc) and
>    then treat them
>    6. Decided to use Zeppelin on prices with Spark FP. It works OK but
>    you don't want to load data from all files. As the number of files grows,
>    the load slows down. This is business driven. A user may want to see old
>    data. So a flat file read through spark csv is not best --> val df =
>    spark.read.option("header", false).csv("hdfs://rhes564:9000/data/prices/prices.*[1-9]").
>    The silly [1-9] is to avoid reading tmp files that flume creates
>    temporary!
>    7. It works in principal but not ideal so the issue of reading to a
>    table periodically
>
>
> [image: Inline images 1]
>
>
> This graph took more than a minute to produce using Zeppelin. This is an
> extreme case with 20 different securities loaded (circles, dummy
> securities). Now this is the batch layer. So I want to read the data into
> database of some sort and let the users use Spark SQL on Zeppelin to query
> it. Data does not have to be extremely up to date May be within the past 15
> minutes
>
> Obviously Spark can read from text files or Parquet files. So at this
> batch layer if we decide to use Zeppelin for now we can use Spark SQL or
> Spark Thrift Server or use some JDBC connection to database. I have some
> colleagues off shore who are building a real time dashboard so it should
> read from both Batch database and any database that SS alerts and write to
> a table (at the moment ORC).
>
> If the price is > 95 then it is buy signal, alert and post to a table
>
> Price on C4 hit 98.86354
> Price on T1 hit 99.47362
> Price on D3 hit 97.75991
> Price on D3 hit 98.90905
> Price on C3 hit 98.25477
>
> So thinking loud going back to batch layer, for now I created an external
> Hive table on the files directory. Will that do the job? It is an ORC table
>
> CREATE EXTERNAL TABLE externalMarketData (
>      INDEX int
>    , TIMECREATED string
>    , SECURITY string
>    , PRICE float
> )
> COMMENT 'From prices Kakfa delived by Flume'
> ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> STORED AS TEXTFILE
> LOCATION '/data/prices/'
> --TBLPROPERTIES ("skip.header.line.count"="1")
>
> Will that do the job? External tables in Hive? IT is ORC. Then I can
> create an internal table in Hive to insert/select from the external table
> to the internal table
>
> CREATE TABLE marketData (
>      INDEX int
>    , TIMECREATED string
>    , SECURITY string
>    , PRICE float
> )
> CLUSTERED BY (INDEX) INTO 256 BUCKETS
> STORED AS ORC
> TBLPROPERTIES (
> "orc.create.index"="true",
> "orc.bloom.filter.columns"="ID",
> "orc.bloom.filter.fpp"="0.05",
> "orc.compress"="SNAPPY",
> "orc.stripe.size"="16777216",
> "orc.row.index.stride"="10000" )
> ;
>
> So I guess with predicate push down it may be performant. This can be a
> Parquet as well. Also notes that the number of columns is minimal at the
> moment.
>
> Cheers
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 15 September 2016 at 15:46, Jeff Nadler <jn...@srcginc.com> wrote:
>
>> Yes we do something very similar and it's working well:
>>
>> Kafka ->
>> Spark Streaming (write temp files, serialized RDDs) ->
>> Spark Batch Application (build partitioned Parquet files on HDFS; this is
>> needed because building Parquet files of a reasonable size is too slow for
>> streaming) ->
>> query with SparkSQL
>>
>>
>> On Thu, Sep 15, 2016 at 7:33 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> If your core requirement is ad-hoc real-time queries over the data,
>>> then the standard Hadoop-centric answer would be:
>>>
>>> Ingest via Kafka,
>>> maybe using Flume, or possibly Spark Streaming, to read and land the
>>> data, in...
>>> Parquet on HDFS or possibly Kudu, and
>>> Impala to query
>>>
>>> >> On 15 September 2016 at 09:35, Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com>
>>> >> wrote:
>>> >>>
>>> >>> Hi,
>>> >>>
>>> >>> This is for fishing for some ideas.
>>> >>>
>>> >>> In the design we get prices directly through Kafka into Flume and
>>> store
>>> >>> it on HDFS as text files
>>> >>> We can then use Spark with Zeppelin to present data to the users.
>>> >>>
>>> >>> This works. However, I am aware that once the volume of flat files
>>> rises
>>> >>> one needs to do housekeeping. You don't want to read all files every
>>> time.
>>> >>>
>>> >>> A more viable alternative would be to read data into some form of
>>> tables
>>> >>> (Hive etc) periodically through an hourly cron set up so batch
>>> process will
>>> >>> have up to date and accurate data up to last hour.
>>> >>>
>>> >>> That certainly be an easier option for the users as well.
>>> >>>
>>> >>> I was wondering what would be the best strategy here. Druid, Hive
>>> others?
>>> >>>
>>> >>> The business case here is that users may want to access older data
>>> so a
>>> >>> database of some sort will be a better solution? In all likelihood
>>> they want
>>> >>> a week's data.
>>> >>>
>>> >>> Thanks
>>> >>>
>>> >>> Dr Mich Talebzadeh
>>> >>>
>>> >>>
>>> >>>
>>> >>> LinkedIn
>>> >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJ
>>> d6zP6AcPCCdOABUrV8Pw
>>> >>>
>>> >>>
>>> >>>
>>> >>> http://talebzadehmich.wordpress.com
>>> >>>
>>> >>>
>>> >>> Disclaimer: Use it at your own risk. Any and all responsibility for
>>> any
>>> >>> loss, damage or destruction of data or any other property which may
>>> arise
>>> >>> from relying on this email's technical content is explicitly
>>> disclaimed. The
>>> >>> author will in no case be liable for any monetary damages arising
>>> from such
>>> >>> loss, damage or destruction.
>>> >>>
>>> >>>
>>> >>
>>> >>
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>
>>
>

Re: Best way to present data collected by Flume through Spark

Posted by Mich Talebzadeh <mi...@gmail.com>.

Thanks guys.

This is I would propose to  proceed.

   1. For speed layer. get data through Spark Streaming (SS), loop over rdd
   for individual prices, do some work and alert the trader. This works OK but
   obviously is expensive resource wise
   2. In the same SS one cannot process RDD to disk  -->
   cachedRDD.saveAsTextFile("/data/prices/prices_" +
   System.currentTimeMillis.toString) and also deal with individual prices.
   It is too time consuming.
   3. Given 2 decided to keep individual prices ONLY processing through SS
   4. Decided to use Flume to hook to Kafka and write the incoming messages
   to HDFS. The same Kafka now feed SS and Flume. This works fine but one ends
   up with loads of files
   5. I prefer to get the prices into HDFS as is (golden source etc) and
   then treat them
   6. Decided to use Zeppelin on prices with Spark FP. It works OK but you
   don't want to load data from all files. As the number of files grows, the
   load slows down. This is business driven. A user may want to see old data.
   So a flat file read through spark csv is not best --> val df =
   spark.read.option("header", false).csv("hdfs://rhes564:
   9000/data/prices/prices.*[1-9]"). The silly [1-9] is to avoid reading
   tmp files that flume creates temporary!
   7. It works in principal but not ideal so the issue of reading to a
   table periodically

[image: Inline images 1]

This graph took more than a minute to produce using Zeppelin. This is an
extreme case with 20 different securities loaded (circles, dummy
securities). Now this is the batch layer. So I want to read the data into
database of some sort and let the users use Spark SQL on Zeppelin to query
it. Data does not have to be extremely up to date May be within the past 15
minutes

Obviously Spark can read from text files or Parquet files. So at this batch
layer if we decide to use Zeppelin for now we can use Spark SQL or Spark
Thrift Server or use some JDBC connection to database. I have some
colleagues off shore who are building a real time dashboard so it should
read from both Batch database and any database that SS alerts and write to
a table (at the moment ORC).

If the price is > 95 then it is buy signal, alert and post to a table

Price on C4 hit 98.86354
Price on T1 hit 99.47362
Price on D3 hit 97.75991
Price on D3 hit 98.90905
Price on C3 hit 98.25477

So thinking loud going back to batch layer, for now I created an external
Hive table on the files directory. Will that do the job? It is an ORC table

CREATE EXTERNAL TABLE externalMarketData (
     INDEX int
   , TIMECREATED string
   , SECURITY string
   , PRICE float
)
COMMENT 'From prices Kakfa delived by Flume'
ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION '/data/prices/'
--TBLPROPERTIES ("skip.header.line.count"="1")

Will that do the job? External tables in Hive? IT is ORC. Then I can create
an internal table in Hive to insert/select from the external table to the
internal table

CREATE TABLE marketData (
     INDEX int
   , TIMECREATED string
   , SECURITY string
   , PRICE float
)
CLUSTERED BY (INDEX) INTO 256 BUCKETS
STORED AS ORC
TBLPROPERTIES (
"orc.create.index"="true",
"orc.bloom.filter.columns"="ID",
"orc.bloom.filter.fpp"="0.05",
"orc.compress"="SNAPPY",
"orc.stripe.size"="16777216",
"orc.row.index.stride"="10000" )
;

So I guess with predicate push down it may be performant. This can be a
Parquet as well. Also notes that the number of columns is minimal at the
moment.

Cheers

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 15 September 2016 at 15:46, Jeff Nadler <jn...@srcginc.com> wrote:

> Yes we do something very similar and it's working well:
>
> Kafka ->
> Spark Streaming (write temp files, serialized RDDs) ->
> Spark Batch Application (build partitioned Parquet files on HDFS; this is
> needed because building Parquet files of a reasonable size is too slow for
> streaming) ->
> query with SparkSQL
>
>
> On Thu, Sep 15, 2016 at 7:33 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> If your core requirement is ad-hoc real-time queries over the data,
>> then the standard Hadoop-centric answer would be:
>>
>> Ingest via Kafka,
>> maybe using Flume, or possibly Spark Streaming, to read and land the
>> data, in...
>> Parquet on HDFS or possibly Kudu, and
>> Impala to query
>>
>> >> On 15 September 2016 at 09:35, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com>
>> >> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> This is for fishing for some ideas.
>> >>>
>> >>> In the design we get prices directly through Kafka into Flume and
>> store
>> >>> it on HDFS as text files
>> >>> We can then use Spark with Zeppelin to present data to the users.
>> >>>
>> >>> This works. However, I am aware that once the volume of flat files
>> rises
>> >>> one needs to do housekeeping. You don't want to read all files every
>> time.
>> >>>
>> >>> A more viable alternative would be to read data into some form of
>> tables
>> >>> (Hive etc) periodically through an hourly cron set up so batch
>> process will
>> >>> have up to date and accurate data up to last hour.
>> >>>
>> >>> That certainly be an easier option for the users as well.
>> >>>
>> >>> I was wondering what would be the best strategy here. Druid, Hive
>> others?
>> >>>
>> >>> The business case here is that users may want to access older data so
>> a
>> >>> database of some sort will be a better solution? In all likelihood
>> they want
>> >>> a week's data.
>> >>>
>> >>> Thanks
>> >>>
>> >>> Dr Mich Talebzadeh
>> >>>
>> >>>
>> >>>
>> >>> LinkedIn
>> >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJ
>> d6zP6AcPCCdOABUrV8Pw
>> >>>
>> >>>
>> >>>
>> >>> http://talebzadehmich.wordpress.com
>> >>>
>> >>>
>> >>> Disclaimer: Use it at your own risk. Any and all responsibility for
>> any
>> >>> loss, damage or destruction of data or any other property which may
>> arise
>> >>> from relying on this email's technical content is explicitly
>> disclaimed. The
>> >>> author will in no case be liable for any monetary damages arising
>> from such
>> >>> loss, damage or destruction.
>> >>>
>> >>>
>> >>
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>

Re: Best way to present data collected by Flume through Spark

Posted by Jeff Nadler <jn...@srcginc.com>.

Yes we do something very similar and it's working well:

Kafka ->
Spark Streaming (write temp files, serialized RDDs) ->
Spark Batch Application (build partitioned Parquet files on HDFS; this is
needed because building Parquet files of a reasonable size is too slow for
streaming) ->
query with SparkSQL


On Thu, Sep 15, 2016 at 7:33 AM, Sean Owen <so...@cloudera.com> wrote:

> If your core requirement is ad-hoc real-time queries over the data,
> then the standard Hadoop-centric answer would be:
>
> Ingest via Kafka,
> maybe using Flume, or possibly Spark Streaming, to read and land the data,
> in...
> Parquet on HDFS or possibly Kudu, and
> Impala to query
>
> >> On 15 September 2016 at 09:35, Mich Talebzadeh <
> mich.talebzadeh@gmail.com>
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> This is for fishing for some ideas.
> >>>
> >>> In the design we get prices directly through Kafka into Flume and store
> >>> it on HDFS as text files
> >>> We can then use Spark with Zeppelin to present data to the users.
> >>>
> >>> This works. However, I am aware that once the volume of flat files
> rises
> >>> one needs to do housekeeping. You don't want to read all files every
> time.
> >>>
> >>> A more viable alternative would be to read data into some form of
> tables
> >>> (Hive etc) periodically through an hourly cron set up so batch process
> will
> >>> have up to date and accurate data up to last hour.
> >>>
> >>> That certainly be an easier option for the users as well.
> >>>
> >>> I was wondering what would be the best strategy here. Druid, Hive
> others?
> >>>
> >>> The business case here is that users may want to access older data so a
> >>> database of some sort will be a better solution? In all likelihood
> they want
> >>> a week's data.
> >>>
> >>> Thanks
> >>>
> >>> Dr Mich Talebzadeh
> >>>
> >>>
> >>>
> >>> LinkedIn
> >>> https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>>
> >>>
> >>>
> >>> http://talebzadehmich.wordpress.com
> >>>
> >>>
> >>> Disclaimer: Use it at your own risk. Any and all responsibility for any
> >>> loss, damage or destruction of data or any other property which may
> arise
> >>> from relying on this email's technical content is explicitly
> disclaimed. The
> >>> author will in no case be liable for any monetary damages arising from
> such
> >>> loss, damage or destruction.
> >>>
> >>>
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Best way to present data collected by Flume through Spark

Posted by Sean Owen <so...@cloudera.com>.

If your core requirement is ad-hoc real-time queries over the data,
then the standard Hadoop-centric answer would be:

Ingest via Kafka,
maybe using Flume, or possibly Spark Streaming, to read and land the data, in...
Parquet on HDFS or possibly Kudu, and
Impala to query

>> On 15 September 2016 at 09:35, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> This is for fishing for some ideas.
>>>
>>> In the design we get prices directly through Kafka into Flume and store
>>> it on HDFS as text files
>>> We can then use Spark with Zeppelin to present data to the users.
>>>
>>> This works. However, I am aware that once the volume of flat files rises
>>> one needs to do housekeeping. You don't want to read all files every time.
>>>
>>> A more viable alternative would be to read data into some form of tables
>>> (Hive etc) periodically through an hourly cron set up so batch process will
>>> have up to date and accurate data up to last hour.
>>>
>>> That certainly be an easier option for the users as well.
>>>
>>> I was wondering what would be the best strategy here. Druid, Hive others?
>>>
>>> The business case here is that users may want to access older data so a
>>> database of some sort will be a better solution? In all likelihood they want
>>> a week's data.
>>>
>>> Thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>>> loss, damage or destruction of data or any other property which may arise
>>> from relying on this email's technical content is explicitly disclaimed. The
>>> author will in no case be liable for any monetary damages arising from such
>>> loss, damage or destruction.
>>>
>>>
>>
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Best way to present data collected by Flume through Spark

Posted by Sachin Janani <sj...@snappydata.io>.

Hi Mich,

I agree that the technology stack that you describe is more difficult to
manage due to different components (like HDFS,Flume,Kafka etc) involved.
The solution to this problem could be, to have some DB which has the
capability to support mix workloads (OLTP,OLAP,Streaming etc) and I think
snappydata <http://www.snappydata.io/> fits better for your problem.
Its an open source distributed in-memory data store with spark as
computational engine and supports real-time operational analytics,
delivering stream analytics, OLTP (online transaction processing) and OLAP
(online analytical processing) in a single integrated cluster.As it is
developed on top of spark ,your existing spark code will work as is.Please
have a look:
http://www.snappydata.io/
http://snappydatainc.github.io/snappydata/


Thanks and Regards,
Sachin Janani

On Thu, Sep 15, 2016 at 7:16 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> any ideas on this?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 15 September 2016 at 09:35, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> Hi,
>>
>> This is for fishing for some ideas.
>>
>> In the design we get prices directly through Kafka into Flume and store
>> it on HDFS as text files
>> We can then use Spark with Zeppelin to present data to the users.
>>
>> This works. However, I am aware that once the volume of flat files rises
>> one needs to do housekeeping. You don't want to read all files every time.
>>
>> A more viable alternative would be to read data into some form of tables
>> (Hive etc) periodically through an hourly cron set up so batch process will
>> have up to date and accurate data up to last hour.
>>
>> That certainly be an easier option for the users as well.
>>
>> I was wondering what would be the best strategy here. Druid, Hive others?
>>
>> The business case here is that users may want to access older data so a
>> database of some sort will be a better solution? In all likelihood they
>> want a week's data.
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>
>

Re: Best way to present data collected by Flume through Spark

Posted by Mich Talebzadeh <mi...@gmail.com>.

any ideas on this?

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 15 September 2016 at 09:35, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Hi,
>
> This is for fishing for some ideas.
>
> In the design we get prices directly through Kafka into Flume and store
> it on HDFS as text files
> We can then use Spark with Zeppelin to present data to the users.
>
> This works. However, I am aware that once the volume of flat files rises
> one needs to do housekeeping. You don't want to read all files every time.
>
> A more viable alternative would be to read data into some form of tables
> (Hive etc) periodically through an hourly cron set up so batch process will
> have up to date and accurate data up to last hour.
>
> That certainly be an easier option for the users as well.
>
> I was wondering what would be the best strategy here. Druid, Hive others?
>
> The business case here is that users may want to access older data so a
> database of some sort will be a better solution? In all likelihood they
> want a week's data.
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>