You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Uwe Dauernheim <uw...@dauernheim.net> on 2015/02/17 20:35:23 UTC

Modeling charts

I try to model a music charts system to get familiar with Samza.
Charts are defined by the top N entries with highest count of a map
from unique track ID, basically a song, to counter, basically the
amount of plays of this entity, during a sliding time-window.

The problem I see is that of an evergrowing size of this map as the ID
space of tracks can be quite large (let's pick 2E6). Not all of these
IDs will be played (thus should be counted) within a given time-window
(let's pick 1 hour) but it's not obvious to me when to prune the map
during this sliding time-window.

I assume dealing with sliding time-windows is a common case for stream
processing thus some useful API provided by Samza. Does an example or
tutorial for this kind of sliding time-window counting example exist?

Re: Modeling charts

Posted by Uwe Dauernheim <uw...@dauernheim.net>.
Thanks Fang, will do some read up.
/Uwe


On Tue, Feb 17, 2015 at 11:01 PM, Yan Fang <ya...@gmail.com> wrote:
> Hi Uwe,
>
> Your use case seems to me is more like a state-management case. What comes
> to my mind is that,
> 1) every time a song is played, you updates the count of this song. You do
> not put the map in memory, as you said, the memory could be quite large.
> Instead, you use Samza's build-in key-value storage. ( you do all this in
> process method )
>
> 2) you scan the whole key-value DB every, say, one hour. ( you do all this
> in window method)
>
> * This could provide better fault-tolerance ( for example, your machine is
> down during the one hour. you will not lose any count number by restoring
> the key-value DB)
>
> Some relevant links:
> *
> http://samza.apache.org/learn/documentation/0.8/container/state-management.html#windowed-aggregation
> *
> http://samza.apache.org/learn/documentation/0.8/container/state-management.html#approaches-to-managing-task-state
> *
> http://samza.apache.org/learn/documentation/0.8/container/state-management.html#key-value-storage
>
> Hope this helps.
>
> Cheers,
>
> Fang, Yan
> yanfang724@gmail.com
> +1 (206) 849-4108
>
> On Tue, Feb 17, 2015 at 11:35 AM, Uwe Dauernheim <uw...@dauernheim.net> wrote:
>
>> I try to model a music charts system to get familiar with Samza.
>> Charts are defined by the top N entries with highest count of a map
>> from unique track ID, basically a song, to counter, basically the
>> amount of plays of this entity, during a sliding time-window.
>>
>> The problem I see is that of an evergrowing size of this map as the ID
>> space of tracks can be quite large (let's pick 2E6). Not all of these
>> IDs will be played (thus should be counted) within a given time-window
>> (let's pick 1 hour) but it's not obvious to me when to prune the map
>> during this sliding time-window.
>>
>> I assume dealing with sliding time-windows is a common case for stream
>> processing thus some useful API provided by Samza. Does an example or
>> tutorial for this kind of sliding time-window counting example exist?
>>

Re: Modeling charts

Posted by Yan Fang <ya...@gmail.com>.
Hi Uwe,

Your use case seems to me is more like a state-management case. What comes
to my mind is that,
1) every time a song is played, you updates the count of this song. You do
not put the map in memory, as you said, the memory could be quite large.
Instead, you use Samza's build-in key-value storage. ( you do all this in
process method )

2) you scan the whole key-value DB every, say, one hour. ( you do all this
in window method)

* This could provide better fault-tolerance ( for example, your machine is
down during the one hour. you will not lose any count number by restoring
the key-value DB)

Some relevant links:
*
http://samza.apache.org/learn/documentation/0.8/container/state-management.html#windowed-aggregation
*
http://samza.apache.org/learn/documentation/0.8/container/state-management.html#approaches-to-managing-task-state
*
http://samza.apache.org/learn/documentation/0.8/container/state-management.html#key-value-storage

Hope this helps.

Cheers,

Fang, Yan
yanfang724@gmail.com
+1 (206) 849-4108

On Tue, Feb 17, 2015 at 11:35 AM, Uwe Dauernheim <uw...@dauernheim.net> wrote:

> I try to model a music charts system to get familiar with Samza.
> Charts are defined by the top N entries with highest count of a map
> from unique track ID, basically a song, to counter, basically the
> amount of plays of this entity, during a sliding time-window.
>
> The problem I see is that of an evergrowing size of this map as the ID
> space of tracks can be quite large (let's pick 2E6). Not all of these
> IDs will be played (thus should be counted) within a given time-window
> (let's pick 1 hour) but it's not obvious to me when to prune the map
> during this sliding time-window.
>
> I assume dealing with sliding time-windows is a common case for stream
> processing thus some useful API provided by Samza. Does an example or
> tutorial for this kind of sliding time-window counting example exist?
>