You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by ra...@gmail.com, ra...@gmail.com on 2018/10/11 02:17:26 UTC

BEAM-2953, Timeseries library

RE: Pull Request : https://github.com/apache/beam/pull/6540

I have been doing some work on a generalized set of timeseries transforms, with the goal to abstract the user from the process of dealing with some of the common problems when working with timeseries in BEAM batch /  stream mode. Would love to get feedback, comments, ideas and I hope, after things flesh out more, collaborators! Of course it will not cover all issues in the timeseries problem space, but from many interactions and discussions over the last couple of years, I feel it has the potential to help with a large enough set of use cases to make it worthwhile endeavor. 

Primary goals:
Remove as much "boilerplate" as possible form common timeseries pre-processing tasks.
Deal with a couple of the harder problems with timeseries when processed as a stream in a distributed system. Some example use cases (which we use state api and timers to solve):
IOT : A device sends signals when something changes but nothing if there has been no update to save battery. The absence of data downstream does not mean that there is no information, it's just not been observed. (Of course it could be the IOT device went boom.. but in the absence of new data, the last known value is assumed until some ttl is reached).
Finance :  Ticks in fx finance data will come with Ask and Bid prices as they change, if however no ASK or BID price is seen the last known value is assumed.
Provide some common sinks as reference, for example output of Tensorflow Sequence Examples onto storage systems. The initial sinks in the pull requests are based on Google Cloud sinks, but this should be expanded to other platforms I hope with the help of some of the good folks on this thread! 

In order to make this a tractable problem, there are some fundamental assumptions that have been made. 

The raw timeseries data will translate to a common representation. The first pass of this is below. Users main 'coding task' will be to convert their objects to :

Single property
https://github.com/rezarokni/beam/blob/timeseries/sdks/java/extensions/timeseries/src/main/proto/TimeSeriesData.proto#L66

Multivariate: 
https://github.com/rezarokni/beam/blob/timeseries/sdks/java/extensions/timeseries/src/main/proto/TimeSeriesData.proto#L75

The primary utility of this library is for stream processing. While it will work fine in batch mode there are many already established tools for dealing with timeseries data that has already landed in a data store. 
This library is not intended as a data analytics tool, although the output of the library has potential to be very useful within analytics tools it is a side benefit.

Would be great to get feedback and if you are interested in helping more directly please ping.

Cheers

Reza