You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Luis Guerra <lu...@gmail.com> on 2014/09/25 22:17:22 UTC

"Ungroup" data

Hi everyone,

I need some advice about how to make the following: having a RDD of vectors
(each vector being Vector(Int, Int , Int, int)), I need to group the data,
then I need to apply a function to every group comparing each consecutive
item within a group and retaining a variable (that has to be added to the
end of each vector) if a condition from the comparison is true.

I show an example next:

(1, 2, 5, 2)
(1, 3, 4, 4)
(1, 3, 7, 3)
(1, 3, 4, 8)

Data are grouped by the two first fields, then for each group I have to
compare each consecutive fourth field,  the first field is used as initial
value and then,  if the next value is greater than the previous one that
will be the next retained value added to the vector.  So,  the output
should be:

(1, 2 , 5, 2, 2)
(1, 3 ,4, 4, 4)
(1, 3 , 7, 3, 4)
(1, 3, 4, 8, 8)

My attempt is a groupBy and then a map with a loop for inside,  then I have
to build a vector of vectors adding the new field. However,  I am not being
able to get the right output since I cannot add a new field to the vector.
I do not know either what should be the right output from the map to get
the same shape than the original data once it has been grouped.  Besides,
my though is that the loop for is not the best option to iterate through
the elements of each group.  And finally,  maybe this can be done with
other operations like reducebykey or so.

Any clue is very appreciated... Thanks in advance!