You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Oleg Dulin <ol...@gmail.com> on 2012/08/21 22:08:18 UTC

Data aggregation -- help me design a solution

Here are my requirements.

We use Cassandra.

I get millions of invoice line items into the system. As I load them I 
need to build up some data structures.

* Invoice line items by invoice id (each line item has an invoice id on 
it ), with total dollar value
* Invoice line items by customer id , with total dollar value
* Invoice line items by territory, with total dollar value

In all of those cases, what we want is to see the total by a given 
attribute, that's all there is to it.

Line items may change daily, i.e. a territory may change or they may 
correct the values. In this case I need to update the aggregations 
accordingly.

Here are my ideas:

- I can use counters and store the data in buckets
- I can just store the data in buckets and do the math in Java

In both cases the challenge is that the items can be updated. Which 
means I need to look up a current version of an item and decide how to 
proceed. That puts a huge performance penalty on the application (# of 
line items we receive is in the millions and we need to process them in 
a timely fashion).

Help me out here -- any ideas on how I could design this in Cassandra ?


Regards,
Oleg



Re: Data aggregation -- help me design a solution

Posted by Milind Parikh <mi...@gmail.com>.
1. Assuming that the majorirty of the line items are new and

2. The lookup of an existing line-item will dictate the performance of the
system  because reads are slower than writes in C*.

3. Assuming that you are using counters in C*

Therefore eliminate that problem by implementing a bloom filter or similar
structure (stable bloom filter) to figure out whether you actually need to
go to C* at all FOR READING of existing line item.

IF YOU NEED TO GO TO C* FOR READS, handle that event (act of getting an
line-item that has already existed) in a seperate set of threads; DECRing
the chosen counters for the previous value of the invoice line-tems


HTH
Regards
Milind



On Tue, Aug 21, 2012 at 1:08 PM, Oleg Dulin <ol...@gmail.com> wrote:

> Here are my requirements.
>
> We use Cassandra.
>
> I get millions of invoice line items into the system. As I load them I
> need to build up some data structures.
>
> * Invoice line items by invoice id (each line item has an invoice id on it
> ), with total dollar value
> * Invoice line items by customer id , with total dollar value
> * Invoice line items by territory, with total dollar value
>
> In all of those cases, what we want is to see the total by a given
> attribute, that's all there is to it.
>
> Line items may change daily, i.e. a territory may change or they may
> correct the values. In this case I need to update the aggregations
> accordingly.
>
> Here are my ideas:
>
> - I can use counters and store the data in buckets
> - I can just store the data in buckets and do the math in Java
>
> In both cases the challenge is that the items can be updated. Which means
> I need to look up a current version of an item and decide how to proceed.
> That puts a huge performance penalty on the application (# of line items we
> receive is in the millions and we need to process them in a timely fashion).
>
> Help me out here -- any ideas on how I could design this in Cassandra ?
>
>
> Regards,
> Oleg
>
>
>

Re: Data aggregation -- help me design a solution

Posted by Guillermo Winkler <gw...@inconcertcc.com>.
Oleg,

If you have the aggregates in counters you only need to read the current
counter when adding/removing invoice lines.

In this situation you only need to be sure this sequence:

+ Read current counter value
+ Update current value according to newly created/updated lines

Is done safely to avoid messing up the current counter with concurrent
updates.

Assuming you don't need to have the counters updated in "real time" you can
also batch the counter update in Java/Redis/Whatever and do the updates in
C* less often.

Best,
Guille

On Tue, Aug 21, 2012 at 5:08 PM, Oleg Dulin <ol...@gmail.com> wrote:

> Here are my requirements.
>
> We use Cassandra.
>
> I get millions of invoice line items into the system. As I load them I
> need to build up some data structures.
>
> * Invoice line items by invoice id (each line item has an invoice id on it
> ), with total dollar value
> * Invoice line items by customer id , with total dollar value
> * Invoice line items by territory, with total dollar value
>
> In all of those cases, what we want is to see the total by a given
> attribute, that's all there is to it.
>
> Line items may change daily, i.e. a territory may change or they may
> correct the values. In this case I need to update the aggregations
> accordingly.
>
> Here are my ideas:
>
> - I can use counters and store the data in buckets
> - I can just store the data in buckets and do the math in Java
>
> In both cases the challenge is that the items can be updated. Which means
> I need to look up a current version of an item and decide how to proceed.
> That puts a huge performance penalty on the application (# of line items we
> receive is in the millions and we need to process them in a timely fashion).
>
> Help me out here -- any ideas on how I could design this in Cassandra ?
>
>
> Regards,
> Oleg
>
>
>