You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Daniel Suo <ds...@CS.Princeton.EDU> on 2016/11/04 16:45:13 UTC

Parallelizing DataStream operations on Array elements

Hello!

I have a data source that emits Arrays that I collect into windows via
countWindow. Rather than parallelize my subsequent operations by groups of
these arrays, I'd like to parallelize my operations across the elements of
the array (rows rather than columns, if you will) within each window.

Some context: I'm attempting a time series analysis across some number of
voxels. Each time step, I receive an Array of voxel data, but I'd like to
analyze the voxels across time.

It sounds like this approach mixes DataStream and DataSet concepts (where
each window is a DataSet), which I know are not supported. Perhaps there is
some other way to accomplish this task?

Thanks!
Daniel

Re: Parallelizing DataStream operations on Array elements

Posted by Till Rohrmann <tr...@apache.org>.
In order to parallelize by voxel you have to do a keyBy(rowId) given that
rowId is the same as voxel id.

Glad to hear that you’ve resolved the problem :-)

Cheers,
Till
​

On Sat, Nov 5, 2016 at 2:47 AM, danielsuo <ds...@cs.princeton.edu> wrote:

> I was able to resolve my issue by collecting all the 'column' Arrays via
> countWindowAll and using flatMap to emit 'row' Arrays.
>
>
>
> --
> View this message in context: http://apache-flink-user-
> mailing-list-archive.2336050.n4.nabble.com/Parallelizing-
> DataStream-operations-on-Array-elements-tp9911p9917.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>

Re: Parallelizing DataStream operations on Array elements

Posted by danielsuo <ds...@cs.princeton.edu>.
I was able to resolve my issue by collecting all the 'column' Arrays via
countWindowAll and using flatMap to emit 'row' Arrays.



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Parallelizing-DataStream-operations-on-Array-elements-tp9911p9917.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Parallelizing DataStream operations on Array elements

Posted by danielsuo <ds...@cs.princeton.edu>.
Till Rohrmann wrote
> I'm not sure whether I grasp the whole problem, but can't you split
> thevector up into the different rows, group by the row index and then
> applysome kind of continuous aggregation or window function?

So I could flatMap my incoming Arrays into (rowId, arrayElement) and gather
them appropriately in a window operation.Here is brief code to describe the
problem:

// Source emits Array[Double]
val input: DataStream[Array[Double]] = env.addSource(new MyArraySource())

// Collect windowSize Array[Double]
input.countWindowAll(windowSize, slideLength)



Now I have a windowSize (representing time) by arrayLength (representing
voxels) matrix. Flink lets me parallelize by time easily, but I'd like to
parallelize by voxel.



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Parallelizing-DataStream-operations-on-Array-elements-tp9911p9916.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Parallelizing DataStream operations on Array elements

Posted by Till Rohrmann <tr...@apache.org>.
Hi Daniel,

I'm not sure whether I grasp the whole problem, but can't you split the
vector up into the different rows, group by the row index and then apply
some kind of continuous aggregation or window function?

Maybe it helps if you can share some of your code with the community to
discuss the implementation.

Cheers,
Till

On Fri, Nov 4, 2016 at 5:45 PM, Daniel Suo <ds...@cs.princeton.edu> wrote:

> Hello!
>
> I have a data source that emits Arrays that I collect into windows via
> countWindow. Rather than parallelize my subsequent operations by groups of
> these arrays, I'd like to parallelize my operations across the elements of
> the array (rows rather than columns, if you will) within each window.
>
> Some context: I'm attempting a time series analysis across some number of
> voxels. Each time step, I receive an Array of voxel data, but I'd like to
> analyze the voxels across time.
>
> It sounds like this approach mixes DataStream and DataSet concepts (where
> each window is a DataSet), which I know are not supported. Perhaps there is
> some other way to accomplish this task?
>
> Thanks!
> Daniel
>