You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Kasper Petersen <ka...@sybogames.com> on 2014/04/01 11:51:11 UTC

Finding cut-off points

Hi,

I have a large amount (can be >100 million) of (id uuid, score int) entries
in Cassandra. I need to, at regular intervals of lets say 30-60 minutes,
find the cut-off points for the score needed to be in the top 0.1%, 33% and
66% of all scores.

What would a good approach be to this problem?

All the data wont fit into memory thus using regular sorting on the
application side won't be possible (unless I do it using a merge sort
algorithm with files, which feels like a bad solution).

Iterating over the data once and build a histogram would cut down the
required memory usage quite significantly, but I'm afraid this could still
end up being "too big". Are there any easier ways to do these computations?

Lastly I've thought about the possibility to use analytics tools to compute
these things for me - would setting up hadoop and/or pig help me do this in
a manner that could make the results accessible to the application servers
once done? I've had a hard time finding any guides on how to set it up and
what exactly I'd be able to do with it afterwards. Any pointers would be
much appreciated.


Best regards,
Kasper

Re: Finding cut-off points

Posted by Steven A Robenalt <sr...@stanford.edu>.
Hi Kasper,

I'd suggest taking a look at Spark, Storm, or Samza (all are Apache
projects) for a possible approach. Depending on your needs and your
existing infrastructure, one of those may work better than others for you.

Steve





On Tue, Apr 1, 2014 at 2:51 AM, Kasper Petersen <ka...@sybogames.com>wrote:

> Hi,
>
> I have a large amount (can be >100 million) of (id uuid, score int)
> entries in Cassandra. I need to, at regular intervals of lets say 30-60
> minutes, find the cut-off points for the score needed to be in the top
> 0.1%, 33% and 66% of all scores.
>
> What would a good approach be to this problem?
>
> All the data wont fit into memory thus using regular sorting on the
> application side won't be possible (unless I do it using a merge sort
> algorithm with files, which feels like a bad solution).
>
> Iterating over the data once and build a histogram would cut down the
> required memory usage quite significantly, but I'm afraid this could still
> end up being "too big". Are there any easier ways to do these computations?
>
> Lastly I've thought about the possibility to use analytics tools to
> compute these things for me - would setting up hadoop and/or pig help me do
> this in a manner that could make the results accessible to the application
> servers once done? I've had a hard time finding any guides on how to set it
> up and what exactly I'd be able to do with it afterwards. Any pointers
> would be much appreciated.
>
>
> Best regards,
> Kasper
>



-- 
Steve Robenalt
Software Architect
HighWire | Stanford University
425 Broadway St, Redwood City, CA 94063

srobenal@stanford.edu
http://highwire.stanford.edu