You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Renato Perini <re...@gmail.com> on 2015/07/23 02:04:56 UTC

Cassandra - Spark - Flume: best architecture for log analytics.

Problem: Log analytics.

Solutions:
        1) Aggregating logs using Flume and storing the aggregations 
into Cassandra. Spark reads data from Cassandra, make some computations
and write the results in distinct tables, still in Cassandra.
        2) Aggregating logs using Flume to a sink, streaming data 
directly into Spark. Spark make some computations and store the results 
in Cassandra.
        3) *** your solution ***

Which is the best workflow for this task?
I would like to setup something flexible enough to allow me to use batch 
processing and realtime streaming without major fuss.

Thank you in advance.

Re: Cassandra - Spark - Flume: best architecture for log analytics.

Posted by Edward Ribeiro <ed...@gmail.com>.

Disclaimer: I have worked for DataStax.

Cassandra is fairly good for log analytics and has been used many places
for that (
https://www.usenix.org/conference/lisa14/conference-program/presentation/josephsen
). Of course, requirements vary from place to place, but it has been a good
fit. Spark and Cassandra have very nice integration, so a Spark worker will
usually read C* rows from a local node instead of bulk loading from remote
nodes, for example (see: https://www.youtube.com/watch?v=_gFgU3phogQ *)*

A third solution, as you asked for, would be:

3)  Aggregating logs using Flume and send the aggregations to one or more
topics on Kafka. Have Spark workers read from the topics, make some
computations and write the results in distinct tables in Cassandra. ( see
https://www.youtube.com/watch?v=GBOk7vh8OgU and
http://blog.sematext.com/2015/04/22/monitoring-stream-processing-tools-cassandra-kafka-and-spark/
 )

In fact, I guess 1) and 3) are good candidates for an architecture, so try
and see what fits best.

Regards,
Ed

On Thu, Jul 23, 2015 at 4:51 AM, Ipremyadav <ip...@gmail.com> wrote:

> Though DSE cassandra comes with hadoop integration, this is clearly is use
> case for hadoop.
> Any reason why cassandra is your first choice?
>
>
>
> On 23 Jul 2015, at 6:12 a.m., Pierre Devops <pi...@gmail.com>
> wrote:
>
> Cassandra is not very good at massive read/bulk read if you need to
> retrieve and compute a large amount of data on multiple machines using
> something like spark or hadoop (or you'll need to hack and process the
> sstable directly, something which is not "natively" supported, you'll have
> to hack your way)
>
> However, it's very good to store and retrieve them once they have been
> processed and sorted. That's why I would opt for solution 2) or for another
> solution which process data before inserting them in cassandra, and doesn't
> use cassandra as a temporary store.
>
> 2015-07-23 2:04 GMT+02:00 Renato Perini <re...@gmail.com>:
>
>> Problem: Log analytics.
>>
>> Solutions:
>>        1) Aggregating logs using Flume and storing the aggregations into
>> Cassandra. Spark reads data from Cassandra, make some computations
>> and write the results in distinct tables, still in Cassandra.
>>        2) Aggregating logs using Flume to a sink, streaming data directly
>> into Spark. Spark make some computations and store the results in Cassandra.
>>        3) *** your solution ***
>>
>> Which is the best workflow for this task?
>> I would like to setup something flexible enough to allow me to use batch
>> processing and realtime streaming without major fuss.
>>
>> Thank you in advance.
>>
>>
>>
>>
>

Re: Cassandra - Spark - Flume: best architecture for log analytics.

Posted by Ipremyadav <ip...@gmail.com>.

Though DSE cassandra comes with hadoop integration, this is clearly is use case for hadoop. 
Any reason why cassandra is your first choice?



> On 23 Jul 2015, at 6:12 a.m., Pierre Devops <pi...@gmail.com> wrote:
> 
> Cassandra is not very good at massive read/bulk read if you need to retrieve and compute a large amount of data on multiple machines using something like spark or hadoop (or you'll need to hack and process the sstable directly, something which is not "natively" supported, you'll have to hack your way)
> 
> However, it's very good to store and retrieve them once they have been processed and sorted. That's why I would opt for solution 2) or for another solution which process data before inserting them in cassandra, and doesn't use cassandra as a temporary store.
> 
> 2015-07-23 2:04 GMT+02:00 Renato Perini <re...@gmail.com>:
>> Problem: Log analytics.
>> 
>> Solutions:
>>        1) Aggregating logs using Flume and storing the aggregations into Cassandra. Spark reads data from Cassandra, make some computations
>> and write the results in distinct tables, still in Cassandra.
>>        2) Aggregating logs using Flume to a sink, streaming data directly into Spark. Spark make some computations and store the results in Cassandra.
>>        3) *** your solution ***
>> 
>> Which is the best workflow for this task?
>> I would like to setup something flexible enough to allow me to use batch processing and realtime streaming without major fuss.
>> 
>> Thank you in advance.
>

Re: Cassandra - Spark - Flume: best architecture for log analytics.

Posted by Pierre Devops <pi...@gmail.com>.

Cassandra is not very good at massive read/bulk read if you need to
retrieve and compute a large amount of data on multiple machines using
something like spark or hadoop (or you'll need to hack and process the
sstable directly, something which is not "natively" supported, you'll have
to hack your way)

However, it's very good to store and retrieve them once they have been
processed and sorted. That's why I would opt for solution 2) or for another
solution which process data before inserting them in cassandra, and doesn't
use cassandra as a temporary store.

2015-07-23 2:04 GMT+02:00 Renato Perini <re...@gmail.com>:

> Problem: Log analytics.
>
> Solutions:
>        1) Aggregating logs using Flume and storing the aggregations into
> Cassandra. Spark reads data from Cassandra, make some computations
> and write the results in distinct tables, still in Cassandra.
>        2) Aggregating logs using Flume to a sink, streaming data directly
> into Spark. Spark make some computations and store the results in Cassandra.
>        3) *** your solution ***
>
> Which is the best workflow for this task?
> I would like to setup something flexible enough to allow me to use batch
> processing and realtime streaming without major fuss.
>
> Thank you in advance.
>
>
>
>