You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Thodoris Bitsakis <t4...@gmail.com> on 2018/05/28 10:41:05 UTC

ML in Streaming API

Hello and thanks for the subscription!

I am using Streaming API to develop a ML algorithm and i would like your
opinions regarding the following issues: 

1) The input is read from a big size file with d-dimensional points, and i
want to perform a parallel count window. In each parallel count window, i
want to perform a function that maintains a list of buckets in memory in
order to be checkpointed(exploiting state feature) . For every parallel
count window some(0*)! of the buckets will be updated or deleted.

My thoughts: 
As there is no logical key and there is no parallel countWinowAll, the
correct way is to perform a parallel flatmap operator? But then i assume
that i must implement a custom buffering of input data using ListState to
implement the countwindow? Also i could use again another ListState to
maintain the list of buckets in memory. But then every time i want to update
a specific buffer of the listState i must clear the ListState and reinsert
all buffers again(not Optimal for big buffers)? 

The other way is to use a deterministic pseudo-key and use
keyby.countwindow. The number of different keys will be the number of
parallelism. In order to update some of the buckets for every key (parallel
instance) i am considering the use of mapState(UK=Bucket index,UV= Bucket
elements). In that case i think the use of pseudo-key is not the best
technique? and also i am going to use unnecessary data shuffle (keyby)?

	What is the best way? Or is there another way to solve the previous issues?

2)When there is no more input data (EOF) or when a user “asks” for a
part-evaluation of the ML algorithm through an external source, i want to
collect the list of buckets from the parallel operator instances to another
reduce-style operator with parallelism 1 to find the final list (classic
scenario of map-reduce). When there is no user query or EOF, I don't want
the parallel operator instances to emit anything.

My thoughts: I don't know how the user will “ask” the flink parallel
operator instances  (parallel count window) to emit their results to the
downstream operator of parallelism 1. I don't know how the operator
instances will know that the file ended (if i use keyby.countwindow i can
use a custom trigger with timer? Else in flatmap case? )

The concept is that the list of buckets in each parallel operator instance
is a local Sketch and i want to collect the local sketches when the user
“asks” to calculate the final Sketch.

Any thoughts are very much appreciated!!! Thank you in advance.
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1474/Untitled.png> 



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/