You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by ju...@free.fr on 2017/07/19 11:30:27 UTC
Feature Generation for Large datasets composed of many time series
Hello,
I want to create a lib which generates features for potentially very
large datasets.
Each file 'F' of my dataset is composed of at least :
- an id ( string or int )
- a timestamp ( or a long value )
- a value ( int or string )
I want my tool to :
- compute aggregate function for many couple 'instants + duration'
===> FOR EXAMPLE :
===== compute for the instant 't = 2001-01-01' aggregate functions for
data between 't-1month and t' and 't-12months and t-9months' and this,
FOR EACH ID !
( aggregate function such as min/max/count/distinct/last/mode or user
defined )
My constraints :
- I don't want to compute aggregate for each tuple of 'F'
---> I want to provide a list of couples 'instants + duration' (
potentially large )
- My 'window' defined by the duration may be really large ( but may
contain only a few values... )
- I may have many id...
- I may have many timestamps...
========================================================
========================================================
========================================================
Let me describe this with some kind of example to see if SPARK ( SPARK
STREAMING ? ) may help me to do that :
Let's imagine that I have all my data in a DB or a file with the
following columns :
id | timestamp(ms) | value
A | 1000000 | 100
A | 1000500 | 66
B | 1000000 | 100
B | 1000010 | 50
B | 1000020 | 200
B | 2500000 | 500
( The timestamp is a long value, so as to be able to express date in ms
from 0000-01-01 to today )
I want to compute operations such as min, max, average, last on the
value column, for a these couples :
-> instant = 1000500 / [-1000ms, 0 ] ( i.e. : agg. data betweem [
t-1000ms and t ]
-> instant = 1333333 / [-5000ms, -2500 ] ( i.e. : agg. data betweem [
t-5000ms and t-2500ms ]
And this will produce this kind of output :
id | timestamp(ms) | min_value | max_value | avg_value | last_value
-------------------------------------------------------------------
A | 1000500 | min... | max.... | avg.... | last....
B | 1000500 | min... | max.... | avg.... | last....
A | 1333333 | min... | max.... | avg.... | last....
B | 1333333 | min... | max.... | avg.... | last....
Do you think we can do this efficiently with spark and/or spark
streaming, and do you have an idea on "how" ?
Thanks a lot !
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Re: Feature generation / aggregate functions / timeseries
Posted by Georg Heiler <ge...@gmail.com>.
Also the rdd stat counter will already conpute most of your desired metrics
as well as df.describe
https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html
Georg Heiler <ge...@gmail.com> schrieb am Do. 14. Dez. 2017 um
19:40:
> Look at custom UADF functions
> <ju...@free.fr> schrieb am Do. 14. Dez. 2017 um 09:31:
>
>> Hi dear spark community !
>>
>> I want to create a lib which generates features for potentially very
>> large datasets, so I believe spark could be a nice tool for that.
>> Let me explain what I need to do :
>>
>> Each file 'F' of my dataset is composed of at least :
>> - an id ( string or int )
>> - a timestamp ( or a long value )
>> - a value ( generaly a double )
>>
>> I want my tool to :
>> - compute aggregate function for many pairs 'instants + duration'
>> ===> FOR EXAMPLE :
>> ===== compute for the instant 't = 2001-01-01' aggregate functions for
>> data between 't-1month and t' and 't-12months and t-9months' and this,
>> FOR EACH ID !
>> ( aggregate functions such as
>> min/max/count/distinct/last/mode/kurtosis... or even user defined ! )
>>
>> My constraints :
>> - I don't want to compute aggregate for each tuple of 'F'
>> ---> I want to provide a list of couples 'instants + duration' (
>> potentially large )
>> - My 'window' defined by the duration may be really large ( but may
>> contain only a few values... )
>> - I may have many id...
>> - I may have many timestamps...
>>
>> ========================================================
>> ========================================================
>> ========================================================
>>
>> Let me describe this with some kind of example to see if SPARK ( SPARK
>> STREAMING ? ) may help me to do that :
>>
>> Let's imagine that I have all my data in a DB or a file with the
>> following columns :
>> id | timestamp(ms) | value
>> A | 1000000 | 100
>> A | 1000500 | 66
>> B | 1000000 | 100
>> B | 1000010 | 50
>> B | 1000020 | 200
>> B | 2500000 | 500
>>
>> ( The timestamp is a long value, so as to be able to express date in ms
>> from 0000-01-01..... to today )
>>
>> I want to compute operations such as min, max, average, last on the
>> value column, for a these couples :
>> -> instant = 1000500 / [-1000ms, 0 ] ( i.e. : aggregate data between [
>> t-1000ms and t ]
>> -> instant = 1333333 / [-5000ms, -2500 ] ( i.e. : aggregate data between
>> [ t-5000ms and t-2500ms ]
>>
>>
>> And this will produce this kind of output :
>>
>> id | timestamp(ms) | min_value | max_value | avg_value | last_value
>> -------------------------------------------------------------------
>> A | 1000500 | min... | max.... | avg.... | last....
>> B | 1000500 | min... | max.... | avg.... | last....
>> A | 1333333 | min... | max.... | avg.... | last....
>> B | 1333333 | min... | max.... | avg.... | last....
>>
>>
>>
>> Do you think we can do this efficiently with spark and/or spark
>> streaming, and do you have an idea on "how" ?
>> ( I have tested some solutions but I'm not really satisfied ATM... )
>>
>>
>> Thanks a lot Community :)
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
Re: Feature generation / aggregate functions / timeseries
Posted by Georg Heiler <ge...@gmail.com>.
Look at custom UADF functions.
<ju...@free.fr> schrieb am Do. 14. Dez. 2017 um 09:31:
> Hi dear spark community !
>
> I want to create a lib which generates features for potentially very
> large datasets, so I believe spark could be a nice tool for that.
> Let me explain what I need to do :
>
> Each file 'F' of my dataset is composed of at least :
> - an id ( string or int )
> - a timestamp ( or a long value )
> - a value ( generaly a double )
>
> I want my tool to :
> - compute aggregate function for many pairs 'instants + duration'
> ===> FOR EXAMPLE :
> ===== compute for the instant 't = 2001-01-01' aggregate functions for
> data between 't-1month and t' and 't-12months and t-9months' and this,
> FOR EACH ID !
> ( aggregate functions such as
> min/max/count/distinct/last/mode/kurtosis... or even user defined ! )
>
> My constraints :
> - I don't want to compute aggregate for each tuple of 'F'
> ---> I want to provide a list of couples 'instants + duration' (
> potentially large )
> - My 'window' defined by the duration may be really large ( but may
> contain only a few values... )
> - I may have many id...
> - I may have many timestamps...
>
> ========================================================
> ========================================================
> ========================================================
>
> Let me describe this with some kind of example to see if SPARK ( SPARK
> STREAMING ? ) may help me to do that :
>
> Let's imagine that I have all my data in a DB or a file with the
> following columns :
> id | timestamp(ms) | value
> A | 1000000 | 100
> A | 1000500 | 66
> B | 1000000 | 100
> B | 1000010 | 50
> B | 1000020 | 200
> B | 2500000 | 500
>
> ( The timestamp is a long value, so as to be able to express date in ms
> from 0000-01-01..... to today )
>
> I want to compute operations such as min, max, average, last on the
> value column, for a these couples :
> -> instant = 1000500 / [-1000ms, 0 ] ( i.e. : aggregate data between [
> t-1000ms and t ]
> -> instant = 1333333 / [-5000ms, -2500 ] ( i.e. : aggregate data between
> [ t-5000ms and t-2500ms ]
>
>
> And this will produce this kind of output :
>
> id | timestamp(ms) | min_value | max_value | avg_value | last_value
> -------------------------------------------------------------------
> A | 1000500 | min... | max.... | avg.... | last....
> B | 1000500 | min... | max.... | avg.... | last....
> A | 1333333 | min... | max.... | avg.... | last....
> B | 1333333 | min... | max.... | avg.... | last....
>
>
>
> Do you think we can do this efficiently with spark and/or spark
> streaming, and do you have an idea on "how" ?
> ( I have tested some solutions but I'm not really satisfied ATM... )
>
>
> Thanks a lot Community :)
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>
Feature generation / aggregate functions / timeseries
Posted by ju...@free.fr.
Hi dear spark community !
I want to create a lib which generates features for potentially very
large datasets, so I believe spark could be a nice tool for that.
Let me explain what I need to do :
Each file 'F' of my dataset is composed of at least :
- an id ( string or int )
- a timestamp ( or a long value )
- a value ( generaly a double )
I want my tool to :
- compute aggregate function for many pairs 'instants + duration'
===> FOR EXAMPLE :
===== compute for the instant 't = 2001-01-01' aggregate functions for
data between 't-1month and t' and 't-12months and t-9months' and this,
FOR EACH ID !
( aggregate functions such as
min/max/count/distinct/last/mode/kurtosis... or even user defined ! )
My constraints :
- I don't want to compute aggregate for each tuple of 'F'
---> I want to provide a list of couples 'instants + duration' (
potentially large )
- My 'window' defined by the duration may be really large ( but may
contain only a few values... )
- I may have many id...
- I may have many timestamps...
========================================================
========================================================
========================================================
Let me describe this with some kind of example to see if SPARK ( SPARK
STREAMING ? ) may help me to do that :
Let's imagine that I have all my data in a DB or a file with the
following columns :
id | timestamp(ms) | value
A | 1000000 | 100
A | 1000500 | 66
B | 1000000 | 100
B | 1000010 | 50
B | 1000020 | 200
B | 2500000 | 500
( The timestamp is a long value, so as to be able to express date in ms
from 0000-01-01..... to today )
I want to compute operations such as min, max, average, last on the
value column, for a these couples :
-> instant = 1000500 / [-1000ms, 0 ] ( i.e. : aggregate data between [
t-1000ms and t ]
-> instant = 1333333 / [-5000ms, -2500 ] ( i.e. : aggregate data between
[ t-5000ms and t-2500ms ]
And this will produce this kind of output :
id | timestamp(ms) | min_value | max_value | avg_value | last_value
-------------------------------------------------------------------
A | 1000500 | min... | max.... | avg.... | last....
B | 1000500 | min... | max.... | avg.... | last....
A | 1333333 | min... | max.... | avg.... | last....
B | 1333333 | min... | max.... | avg.... | last....
Do you think we can do this efficiently with spark and/or spark
streaming, and do you have an idea on "how" ?
( I have tested some solutions but I'm not really satisfied ATM... )
Thanks a lot Community :)
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Fwd: Feature Generation for Large datasets composed of many time
series
Posted by ju...@free.fr.
Hi dear spark community !
I want to create a lib which generates features for potentially very
large datasets, so I believe spark could be a nice tool for that.
Let me explain what I need to do :
Each file 'F' of my dataset is composed of at least :
- an id ( string or int )
- a timestamp ( or a long value )
- a value ( generaly a double )
I want my tool to :
- compute aggregate function for many pairs 'instants + duration'
===> FOR EXAMPLE :
===== compute for the instant 't = 2001-01-01' aggregate functions for
data between 't-1month and t' and 't-12months and t-9months' and this,
FOR EACH ID !
( aggregate functions such as
min/max/count/distinct/last/mode/kurtosis... or even user defined ! )
My constraints :
- I don't want to compute aggregate for each tuple of 'F'
---> I want to provide a list of couples 'instants + duration' (
potentially large )
- My 'window' defined by the duration may be really large ( but may
contain only a few values... )
- I may have many id...
- I may have many timestamps...
========================================================
========================================================
========================================================
Let me describe this with some kind of example to see if SPARK ( SPARK
STREAMING ? ) may help me to do that :
Let's imagine that I have all my data in a DB or a file with the
following columns :
id | timestamp(ms) | value
A | 1000000 | 100
A | 1000500 | 66
B | 1000000 | 100
B | 1000010 | 50
B | 1000020 | 200
B | 2500000 | 500
( The timestamp is a long value, so as to be able to express date in ms
from 0000-01-01..... to today )
I want to compute operations such as min, max, average, last on the
value column, for a these couples :
-> instant = 1000500 / [-1000ms, 0 ] ( i.e. : aggregate data between [
t-1000ms and t ]
-> instant = 1333333 / [-5000ms, -2500 ] ( i.e. : aggregate data between
[ t-5000ms and t-2500ms ]
And this will produce this kind of output :
id | timestamp(ms) | min_value | max_value | avg_value | last_value
-------------------------------------------------------------------
A | 1000500 | min... | max.... | avg.... | last....
B | 1000500 | min... | max.... | avg.... | last....
A | 1333333 | min... | max.... | avg.... | last....
B | 1333333 | min... | max.... | avg.... | last....
Do you think we can do this efficiently with spark and/or spark
streaming, and do you have an idea on "how" ?
( I have tested some solutions but I'm not really satisfied ATM... )
Thanks a lot Community :)
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org