You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Paulo Gabriel Poiati <pa...@gmail.com> on 2010/05/11 20:52:53 UTC

Real-time Web Analysis tool using Cassandra. Doubts...

Hi all.

I thinking about implementing a real-time WA tool using Cassandra as my
storage. But i have some questions first.

I'm considering Cassandra because of its excellent write performance,
horizontal scalability and its tunable consistency level.

- First of all, my first thoughts is to have two CF one for raw client
request (~10 millions++ per day) and other for aggregated metrics in some
defined inteval time like 1min, 5min, 15min... Is this a good approach ?

- It is a good idea to use a OrderPreservingPartitioner ? To maintain the
order of my requests in the raw data CF ? Or the overhead is too big.

- Initially the cluster will contain only three nodes, is it a problem (to
few maybe) ?

- I think the best way to do the aggregation job is through a hadoop
MapReduce job. Right ? Is there any other way to consider ?

- Is really Cassandra suitable for it ? Maybe HBase is better in this case?

Any other fact that u guys want to make me aware of, plz do it.

Tks,
Paulo Poiati.

Re: Real-time Web Analysis tool using Cassandra. Doubts...

Posted by Utku Can Topçu <ut...@topcu.gen.tr>.
What makes cassandra a poor choice is the fact that, you can't use a
keyrange as input for the map phase for Hadoop.


On Wed, May 12, 2010 at 4:37 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> On Tue, May 11, 2010 at 1:52 PM, Paulo Gabriel Poiati
> <pa...@gmail.com> wrote:
> > - First of all, my first thoughts is to have two CF one for raw client
> > request (~10 millions++ per day) and other for aggregated metrics in some
> > defined inteval time like 1min, 5min, 15min... Is this a good approach ?
>
> Sure.
>
> > - It is a good idea to use a OrderPreservingPartitioner ? To maintain the
> > order of my requests in the raw data CF ? Or the overhead is too big.
>
> The problem with OPP isn't overhead (it is lower-overhead than RP) but
> the tendency to have hotspots in sequentially-written data.
>
> > - Initially the cluster will contain only three nodes, is it a problem
> (to
> > few maybe) ?
>
> You'll have to do some load testing to see.
>
> > - I think the best way to do the aggregation job is through a hadoop
> > MapReduce job. Right ? Is there any other way to consider ?
>
> Map/Reduce is usually better than rolling your own because it
> parallelizes for you.
>
> > - Is really Cassandra suitable for it ? Maybe HBase is better in this
> case?
>
> Nothing here makes me think "Cassandra is a poor choice."
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

Re: Real-time Web Analysis tool using Cassandra. Doubts...

Posted by Jonathan Ellis <jb...@gmail.com>.
On Tue, May 11, 2010 at 1:52 PM, Paulo Gabriel Poiati
<pa...@gmail.com> wrote:
> - First of all, my first thoughts is to have two CF one for raw client
> request (~10 millions++ per day) and other for aggregated metrics in some
> defined inteval time like 1min, 5min, 15min... Is this a good approach ?

Sure.

> - It is a good idea to use a OrderPreservingPartitioner ? To maintain the
> order of my requests in the raw data CF ? Or the overhead is too big.

The problem with OPP isn't overhead (it is lower-overhead than RP) but
the tendency to have hotspots in sequentially-written data.

> - Initially the cluster will contain only three nodes, is it a problem (to
> few maybe) ?

You'll have to do some load testing to see.

> - I think the best way to do the aggregation job is through a hadoop
> MapReduce job. Right ? Is there any other way to consider ?

Map/Reduce is usually better than rolling your own because it
parallelizes for you.

> - Is really Cassandra suitable for it ? Maybe HBase is better in this case?

Nothing here makes me think "Cassandra is a poor choice."

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com