You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by ju...@free.fr on 2017/07/19 11:31:13 UTC

Feature Generation for Large datasets composed of many time series


Hello,

I want to create a lib which generates features for potentially very 
large datasets.

Each file 'F' of my dataset is composed of at least :
- an id ( string or int )
- a timestamp ( or a long value )
- a value ( int or string )

I want my tool to :
- compute aggregate function for many couple 'instants + duration'
===> FOR EXAMPLE :
===== compute for the instant 't = 2001-01-01' aggregate functions for 
data between 't-1month and t' and 't-12months and t-9months' and this, 
FOR EACH ID !
( aggregate function such as min/max/count/distinct/last/mode or user 
defined )

My constraints :
- I don't want to compute aggregate for each tuple of 'F'
---> I want to provide a list of couples 'instants + duration' ( 
potentially large )
- My 'window' defined by the duration may be really large ( but may 
contain only a few values... )
- I may have many id...
- I may have many timestamps...

========================================================
========================================================
========================================================

Let me describe this with some kind of example to see if Apache Beam may 
help me to do that :

Let's imagine that I have all my data in a DB or a file with the 
following columns :
id | timestamp(ms) | value
A | 1000000 |  100
A | 1000500 |  66
B | 1000000 |  100
B | 1000010 |  50
B | 1000020 |  200
B | 2500000 |  500

( The timestamp is a long value, so as to be able to express date in ms 
from 0000-01-01 to today )

I want to compute operations such as min, max, average, last on the 
value column, for a these couples :
-> instant = 1000500 / [-1000ms, 0 ] ( i.e. : agg. data betweem [ 
t-1000ms and t ]
-> instant = 1333333 / [-5000ms, -2500 ] ( i.e. : agg. data betweem [ 
t-5000ms and t-2500ms ]


And this will produce this kind of output :

id | timestamp(ms) | min_value | max_value | avg_value | last_value
-------------------------------------------------------------------
A | 1000500        | min...    | max....   | avg....   | last....
B | 1000500        | min...    | max....   | avg....   | last....
A | 1333333        | min...    | max....   | avg....   | last....
B | 1333333        | min...    | max....   | avg....   | last....



Do you think we can do this efficiently with Apache beam, and do you 
have an idea on "how" ?


Thanks a lot ....

Re: Feature Generation for Large datasets composed of many time series

Posted by Lukasz Cwik <lc...@google.com>.
The more ids the better as this increases the parallelism in your pipeline.
Also, the aggregations that you listed like min/max/average are very
efficient operations to perform on datasets.

Cassandra is already supported:
https://github.com/apache/beam/tree/master/sdks/java/io/cassandra
Using a data format which is common between Apache Beam and Apache SPARK
should make it relatively easy to pass the data from one processing system
to the next.



On Mon, Jul 24, 2017 at 2:00 AM, <ju...@free.fr> wrote:

> Ok thanks !
>
> That's exactly the kind of thing I was imagining with Apache BEAM.
>
> I still have a few questions.
> - regarding performances will this be efficient ? Even with large "window"
> / many id / values / timestamps ... ?
> - my goal after all this is to store it in cassandra and/or use the final
> dataset with Apache SPARK. Will it be easy to do this ?
>
> Thanks again Lukasz !
>
>
> Le 2017-07-23 20:42, Lukasz Cwik a écrit :
>
>> You can do this efficiently with Apache Beam but you would need to
>> write code which converts a users expression into a set of PTransforms
>> or create a few pipeline variants for commonly computed outcomes.
>> There are already many transforms which can compute things like min,
>> max, average. Take a look at the javadoc[1]. It seems like you would
>> want to structure your pipeline like:
>> ReadFromFile -> FilterRecordsBasedUponTimestamp ->
>> Min.perKey/Max.perKey/Average.perKey/... -> OutputToFile
>>
>> It doesn't seem like windowing/triggers would provide you much value
>> based upon what you describe.
>>
>> Also, it sounds like you would be interested in the SQL development
>> that is ongoing which would allow users to write these kinds of
>> queries without needing to write a complicated pipeline. The feature
>> branch[2] is looking to be merged into master soon and become part of
>> the next release.
>> 1:
>> https://beam.apache.org/documentation/sdks/javadoc/2.0.0/
>> index.html?org/apache/beam/sdk/transforms/Min.html
>> 2: https://github.com/apache/beam/tree/DSL_SQL
>>
>> On Wed, Jul 19, 2017 at 4:31 AM, <ju...@free.fr> wrote:
>>
>> Hello,
>>>
>>> I want to create a lib which generates features for potentially very
>>> large datasets.
>>>
>>> Each file 'F' of my dataset is composed of at least :
>>> - an id ( string or int )
>>> - a timestamp ( or a long value )
>>> - a value ( int or string )
>>>
>>> I want my tool to :
>>> - compute aggregate function for many couple 'instants + duration'
>>> ===> FOR EXAMPLE :
>>> ===== compute for the instant 't = 2001-01-01' aggregate functions
>>> for data between 't-1month and t' and 't-12months and t-9months' and
>>> this, FOR EACH ID !
>>> ( aggregate function such as min/max/count/distinct/last/mode or
>>> user defined )
>>>
>>> My constraints :
>>> - I don't want to compute aggregate for each tuple of 'F'
>>> ---> I want to provide a list of couples 'instants + duration' (
>>> potentially large )
>>> - My 'window' defined by the duration may be really large ( but may
>>> contain only a few values... )
>>> - I may have many id...
>>> - I may have many timestamps...
>>>
>>> ========================================================
>>> ========================================================
>>> ========================================================
>>>
>>> Let me describe this with some kind of example to see if Apache Beam
>>> may help me to do that :
>>>
>>> Let's imagine that I have all my data in a DB or a file with the
>>> following columns :
>>> id | timestamp(ms) | value
>>> A | 1000000 |  100
>>> A | 1000500 |  66
>>> B | 1000000 |  100
>>> B | 1000010 |  50
>>> B | 1000020 |  200
>>> B | 2500000 |  500
>>>
>>> ( The timestamp is a long value, so as to be able to express date in
>>> ms from 0000-01-01 to today )
>>>
>>> I want to compute operations such as min, max, average, last on the
>>> value column, for a these couples :
>>> -> instant = 1000500 / [-1000ms, 0 ] ( i.e. : agg. data betweem [
>>> t-1000ms and t ]
>>> -> instant = 1333333 / [-5000ms, -2500 ] ( i.e. : agg. data betweem
>>> [ t-5000ms and t-2500ms ]
>>>
>>> And this will produce this kind of output :
>>>
>>> id | timestamp(ms) | min_value | max_value | avg_value | last_value
>>> -------------------------------------------------------------------
>>> A | 1000500        | min...    | max....   | avg....   | last....
>>> B | 1000500        | min...    | max....   | avg....   | last....
>>> A | 1333333        | min...    | max....   | avg....   | last....
>>> B | 1333333        | min...    | max....   | avg....   | last....
>>>
>>> Do you think we can do this efficiently with Apache beam, and do you
>>> have an idea on "how" ?
>>>
>>> Thanks a lot ....
>>>
>>

Re: Feature Generation for Large datasets composed of many time series

Posted by ju...@free.fr.
Ok thanks !

That's exactly the kind of thing I was imagining with Apache BEAM.

I still have a few questions.
- regarding performances will this be efficient ? Even with large 
"window" / many id / values / timestamps ... ?
- my goal after all this is to store it in cassandra and/or use the 
final dataset with Apache SPARK. Will it be easy to do this ?

Thanks again Lukasz !

Le 2017-07-23 20:42, Lukasz Cwik a écrit :
> You can do this efficiently with Apache Beam but you would need to
> write code which converts a users expression into a set of PTransforms
> or create a few pipeline variants for commonly computed outcomes.
> There are already many transforms which can compute things like min,
> max, average. Take a look at the javadoc[1]. It seems like you would
> want to structure your pipeline like:
> ReadFromFile -> FilterRecordsBasedUponTimestamp ->
> Min.perKey/Max.perKey/Average.perKey/... -> OutputToFile
> 
> It doesn't seem like windowing/triggers would provide you much value
> based upon what you describe.
> 
> Also, it sounds like you would be interested in the SQL development
> that is ongoing which would allow users to write these kinds of
> queries without needing to write a complicated pipeline. The feature
> branch[2] is looking to be merged into master soon and become part of
> the next release.
> 1:
> https://beam.apache.org/documentation/sdks/javadoc/2.0.0/index.html?org/apache/beam/sdk/transforms/Min.html
> 2: https://github.com/apache/beam/tree/DSL_SQL
> 
> On Wed, Jul 19, 2017 at 4:31 AM, <ju...@free.fr> wrote:
> 
>> Hello,
>> 
>> I want to create a lib which generates features for potentially very
>> large datasets.
>> 
>> Each file 'F' of my dataset is composed of at least :
>> - an id ( string or int )
>> - a timestamp ( or a long value )
>> - a value ( int or string )
>> 
>> I want my tool to :
>> - compute aggregate function for many couple 'instants + duration'
>> ===> FOR EXAMPLE :
>> ===== compute for the instant 't = 2001-01-01' aggregate functions
>> for data between 't-1month and t' and 't-12months and t-9months' and
>> this, FOR EACH ID !
>> ( aggregate function such as min/max/count/distinct/last/mode or
>> user defined )
>> 
>> My constraints :
>> - I don't want to compute aggregate for each tuple of 'F'
>> ---> I want to provide a list of couples 'instants + duration' (
>> potentially large )
>> - My 'window' defined by the duration may be really large ( but may
>> contain only a few values... )
>> - I may have many id...
>> - I may have many timestamps...
>> 
>> ========================================================
>> ========================================================
>> ========================================================
>> 
>> Let me describe this with some kind of example to see if Apache Beam
>> may help me to do that :
>> 
>> Let's imagine that I have all my data in a DB or a file with the
>> following columns :
>> id | timestamp(ms) | value
>> A | 1000000 |  100
>> A | 1000500 |  66
>> B | 1000000 |  100
>> B | 1000010 |  50
>> B | 1000020 |  200
>> B | 2500000 |  500
>> 
>> ( The timestamp is a long value, so as to be able to express date in
>> ms from 0000-01-01 to today )
>> 
>> I want to compute operations such as min, max, average, last on the
>> value column, for a these couples :
>> -> instant = 1000500 / [-1000ms, 0 ] ( i.e. : agg. data betweem [
>> t-1000ms and t ]
>> -> instant = 1333333 / [-5000ms, -2500 ] ( i.e. : agg. data betweem
>> [ t-5000ms and t-2500ms ]
>> 
>> And this will produce this kind of output :
>> 
>> id | timestamp(ms) | min_value | max_value | avg_value | last_value
>> -------------------------------------------------------------------
>> A | 1000500        | min...    | max....   | avg....   | last....
>> B | 1000500        | min...    | max....   | avg....   | last....
>> A | 1333333        | min...    | max....   | avg....   | last....
>> B | 1333333        | min...    | max....   | avg....   | last....
>> 
>> Do you think we can do this efficiently with Apache beam, and do you
>> have an idea on "how" ?
>> 
>> Thanks a lot ....

Re: Feature Generation for Large datasets composed of many time series

Posted by Lukasz Cwik <lc...@google.com>.
You can do this efficiently with Apache Beam but you would need to write
code which converts a users expression into a set of PTransforms or create
a few pipeline variants for commonly computed outcomes. There are already
many transforms which can compute things like min, max, average. Take a
look at the javadoc[1]. It seems like you would want to structure your
pipeline like:
ReadFromFile -> FilterRecordsBasedUponTimestamp ->
Min.perKey/Max.perKey/Average.perKey/... -> OutputToFile

It doesn't seem like windowing/triggers would provide you much value based
upon what you describe.

Also, it sounds like you would be interested in the SQL development that is
ongoing which would allow users to write these kinds of queries without
needing to write a complicated pipeline. The feature branch[2] is looking
to be merged into master soon and become part of the next release.
1:
https://beam.apache.org/documentation/sdks/javadoc/2.0.0/index.html?org/apache/beam/sdk/transforms/Min.html
2: https://github.com/apache/beam/tree/DSL_SQL

On Wed, Jul 19, 2017 at 4:31 AM, <ju...@free.fr> wrote:

>
>
> Hello,
>
> I want to create a lib which generates features for potentially very large
> datasets.
>
> Each file 'F' of my dataset is composed of at least :
> - an id ( string or int )
> - a timestamp ( or a long value )
> - a value ( int or string )
>
> I want my tool to :
> - compute aggregate function for many couple 'instants + duration'
> ===> FOR EXAMPLE :
> ===== compute for the instant 't = 2001-01-01' aggregate functions for
> data between 't-1month and t' and 't-12months and t-9months' and this, FOR
> EACH ID !
> ( aggregate function such as min/max/count/distinct/last/mode or user
> defined )
>
> My constraints :
> - I don't want to compute aggregate for each tuple of 'F'
> ---> I want to provide a list of couples 'instants + duration' (
> potentially large )
> - My 'window' defined by the duration may be really large ( but may
> contain only a few values... )
> - I may have many id...
> - I may have many timestamps...
>
> ========================================================
> ========================================================
> ========================================================
>
> Let me describe this with some kind of example to see if Apache Beam may
> help me to do that :
>
> Let's imagine that I have all my data in a DB or a file with the following
> columns :
> id | timestamp(ms) | value
> A | 1000000 |  100
> A | 1000500 |  66
> B | 1000000 |  100
> B | 1000010 |  50
> B | 1000020 |  200
> B | 2500000 |  500
>
> ( The timestamp is a long value, so as to be able to express date in ms
> from 0000-01-01 to today )
>
> I want to compute operations such as min, max, average, last on the value
> column, for a these couples :
> -> instant = 1000500 / [-1000ms, 0 ] ( i.e. : agg. data betweem [ t-1000ms
> and t ]
> -> instant = 1333333 / [-5000ms, -2500 ] ( i.e. : agg. data betweem [
> t-5000ms and t-2500ms ]
>
>
> And this will produce this kind of output :
>
> id | timestamp(ms) | min_value | max_value | avg_value | last_value
> -------------------------------------------------------------------
> A | 1000500        | min...    | max....   | avg....   | last....
> B | 1000500        | min...    | max....   | avg....   | last....
> A | 1333333        | min...    | max....   | avg....   | last....
> B | 1333333        | min...    | max....   | avg....   | last....
>
>
>
> Do you think we can do this efficiently with Apache beam, and do you have
> an idea on "how" ?
>
>
> Thanks a lot ....
>