You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by otherwise777 <wo...@onzichtbaar.net> on 2016/10/30 07:25:55 UTC

Looping over a DataSet and accesing another DataSet

Currently i'm trying to implement this algorithm [1] which requires me to
loop over one DataSet (the edges) and access another DataSet (the vertices),
for this loop i use a Mapping (i'm not sure if this is the correct way of
looping over a DataSet) but i don't know how to access the elements of
another DataSet while i'm looping over one.

I know Gelly also has iterative support for these kind of things, but they
loop over the Vertices and not the Edges

[1] http://prntscr.com/d0qeyd



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Looping-over-a-DataSet-and-accesing-another-DataSet-tp9778.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Looping over a DataSet and accesing another DataSet

Posted by otherwise777 <wo...@onzichtbaar.net>.
I just found out that I am able to use arrays in tuple values, nvm about that
question



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Looping-over-a-DataSet-and-accesing-another-DataSet-tp9778p9850.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Looping over a DataSet and accesing another DataSet

Posted by otherwise777 <wo...@onzichtbaar.net>.
I did mean the iteratino yes, I currently solved the problem by rewriting the
algorithm in gelly's GathersumApply model, thnx for the tips

I had another question regarding the original message, about appending items
to a list, how would I do that? Because afaik it's not possible to add a
list or array in a Tuple element right?





--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Looping-over-a-DataSet-and-accesing-another-DataSet-tp9778p9843.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Looping over a DataSet and accesing another DataSet

Posted by Greg Hogan <co...@greghogan.com>.
By 'loop' do you refer to an iteration? The output of a bulk iteration is
processed as the input of the following iteration. Values updated in an
iteration are available in the next iteration just as values updated by an
operator are available to the following operator.

Your chosen algorithm may not be a good fit for distributed processing
frameworks like Flink, Spark, and Hadoop. You may need to recast your
problem into an appropriate, scalable algorithm. Both the Gelly and Machine
Learning libraries have good examples of efficient, scalable algorithms
(Flink's "examples" demonstrate specific functionality).

Greg

On Mon, Oct 31, 2016 at 8:52 AM, otherwise777 <wo...@onzichtbaar.net>
wrote:

> Thank you for your reply, this is new information for me,
>
> Regarding the algorithm, i gave it a better look and i don't think it will
> work with joining. When looping over the Edge set (u,v) we need to be able
> to write and read A[u] and A[v]. If i join them it will create a new
> instances of that value and it doesn't matter if it's changed in one
> instance.
>
> For example i have the following edges:
>  u v
>  1 2
>  1 3
>
> With vertices and values:
>  1 a
>  2 b
>  3 c
>
> If i join them i get:
>  u v u' v'
>  1 2 a b
>  1 3 a c
>
> If i loop over the joined set and change the u' value of the first instance
> to "d" then in my next loop step it will be 'a'.
>
>
>
>
> --
> View this message in context: http://apache-flink-user-
> mailing-list-archive.2336050.n4.nabble.com/Looping-over-a-
> DataSet-and-accesing-another-DataSet-tp9778p9784.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>

Re: Looping over a DataSet and accesing another DataSet

Posted by otherwise777 <wo...@onzichtbaar.net>.
Thank you for your reply, this is new information for me,

Regarding the algorithm, i gave it a better look and i don't think it will
work with joining. When looping over the Edge set (u,v) we need to be able
to write and read A[u] and A[v]. If i join them it will create a new
instances of that value and it doesn't matter if it's changed in one
instance.

For example i have the following edges:
 u v
 1 2
 1 3

With vertices and values:
 1 a
 2 b
 3 c

If i join them i get:
 u v u' v'
 1 2 a b
 1 3 a c

If i loop over the joined set and change the u' value of the first instance
to "d" then in my next loop step it will be 'a'.




--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Looping-over-a-DataSet-and-accesing-another-DataSet-tp9778p9784.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Looping over a DataSet and accesing another DataSet

Posted by Greg Hogan <co...@greghogan.com>.
The DataSet API only supports binary joins but one can simulate an n-ary
join by chaining successive join operations.

Your algorithm requires a global ordering on edges, requiring a parallelism
of 1, and will not scale in a distributed processing system. Flink excels
at processing bulk (larger than memory) data in serial.

Greg

On Mon, Oct 31, 2016 at 5:54 AM, otherwise777 <wo...@onzichtbaar.net>
wrote:

> Thank you for your reply and explanation, I think there is one issue with
> your method though, you said that i should make a join with the the key
> value pair A on v and  the Edge set (u,v), this would work, however i not
> only need to access A[v] in one iteration but also A[u], so if i join on v
> that won't be possible
>
> Did i understand it correctly?
>
>
>
> --
> View this message in context: http://apache-flink-user-
> mailing-list-archive.2336050.n4.nabble.com/Looping-over-a-
> DataSet-and-accesing-another-DataSet-tp9778p9782.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>

Re: Looping over a DataSet and accesing another DataSet

Posted by otherwise777 <wo...@onzichtbaar.net>.
Thank you for your reply and explanation, I think there is one issue with
your method though, you said that i should make a join with the the key
value pair A on v and  the Edge set (u,v), this would work, however i not
only need to access A[v] in one iteration but also A[u], so if i join on v
that won't be possible

Did i understand it correctly?



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Looping-over-a-DataSet-and-accesing-another-DataSet-tp9778p9782.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Looping over a DataSet and accesing another DataSet

Posted by Gábor Gévay <gg...@gmail.com>.
Hello,

In Flink, one often used way to access data from multiple DataSets at
the same time is to perform a join (Flink actually calls equi-joins
[1] just "join"), just as in the database world.

For example, in the algorithm that you linked, you access A[u] for
every edge (u,v). I assume that you have stored A in a DataSet of
(index, value) pairs. You can achieve this access pattern by
performing a join, and in the join condition you specify that the
first endpoint of the edge should be equal to the index of A. This
way, you get a DataSet where every record contains an edge (u,v) and
also A[u], so you can do a map on this where the UDF of your map will
get (u,v) and A[u].

Your algorithm also accesses A[v], which can be achieved by performing
a second join that is similar to the first (using the result of the
first).

However, the updating of P will be more tricky to translate to Flink.
I'm not sure I undersand the linked algorithm correctly: does every
element of P contain a list, and the + means appending an element to a
list? (in the line P[v] = P[u] + v)

Best,
Gábor

[1] https://en.wikipedia.org/wiki/Join_(SQL)#Equi-join



2016-10-30 8:25 GMT+01:00 otherwise777 <wo...@onzichtbaar.net>:
> Currently i'm trying to implement this algorithm [1] which requires me to
> loop over one DataSet (the edges) and access another DataSet (the vertices),
> for this loop i use a Mapping (i'm not sure if this is the correct way of
> looping over a DataSet) but i don't know how to access the elements of
> another DataSet while i'm looping over one.
>
> I know Gelly also has iterative support for these kind of things, but they
> loop over the Vertices and not the Edges
>
> [1] http://prntscr.com/d0qeyd
>
>
>
> --
> View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Looping-over-a-DataSet-and-accesing-another-DataSet-tp9778.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.