You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by Xavier Daull <xa...@gmail.com> on 2014/04/24 19:40:08 UTC

time aggregated data and shared structures efficiency

I have already developed a Python script (not using storm) which transforms
a stream of millions of prices history of different items (provided in 1
common csv) and output dedicate streams for each item with enriched data in
real-time. This script computes and aggregates in real-time latest item
price with past data to get moving average and slop over different
timeframes (month/week/day/hour) and add to it latest data from nearest
items (neighbours). The goal is to feed models for price prediction. In
order to manage time aggregated data and nearest neighbours data I use a
shared buffer of recent data needed for aggregation, latest computed data
for each item and some shared timestamp indexes.

I am wondering if I would really benefit from moving this script to storm
and how.

My first understanding of storm is I should:
- create a dedicated spout class to fetch prices data.
- create a dedicated bolt class to aggregate data (moving average / slopes
/ cross aggregated data between items).

Where should I put my shared buffers and data required to efficiently
aggregate and compute my time aggregated data and nearest neighbours data ?

Will the topology impact performance compared to in-memory data management
? My current script, even if it is in Python, highly benefits from
efficient buffered computation (no recompute, use delta average...), few
data manipulation, minimum access to memory and computation.

Thank you for your advice.
Xavier

Re: time aggregated data and shared structures efficiency

Posted by Xavier Daull <xa...@gmail.com>.
Hi, any hint ? is my previous question unclear ?

I try to reformulate:
- how should I manage my shared buffers and data required for
moving/sliding windows management and nearest neighbours data aggregation ?
- would I really benefit from moving to storm vs my current script
considering that in-memory data management highly speed up my current
process ?

Thank you,
Xavier


On Thu, Apr 24, 2014 at 7:40 PM, Xavier Daull <xa...@gmail.com> wrote:

> I have already developed a Python script (not using storm) which
> transforms a stream of millions of prices history of different items
> (provided in 1 common csv) and output dedicate streams for each item with
> enriched data in real-time. This script computes and aggregates in
> real-time latest item price with past data to get moving average and slop
> over different timeframes (month/week/day/hour) and add to it latest data
> from nearest items (neighbours). The goal is to feed models for price
> prediction. In order to manage time aggregated data and nearest neighbours
> data I use a shared buffer of recent data needed for aggregation, latest
> computed data for each item and some shared timestamp indexes.
>
> I am wondering if I would really benefit from moving this script to storm
> and how.
>
> My first understanding of storm is I should:
> - create a dedicated spout class to fetch prices data.
> - create a dedicated bolt class to aggregate data (moving average / slopes
> / cross aggregated data between items).
>
> Where should I put my shared buffers and data required to efficiently
> aggregate and compute my time aggregated data and nearest neighbours data ?
>
> Will the topology impact performance compared to in-memory data management
> ? My current script, even if it is in Python, highly benefits from
> efficient buffered computation (no recompute, use delta average...), few
> data manipulation, minimum access to memory and computation.
>
> Thank you for your advice.
> Xavier
>