You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Richard Siebeling <rs...@gmail.com> on 2016/01/15 15:27:16 UTC

Stacking transformations and using intermediate results in the next transformation

Hi,

we're stacking multiple RDD operations on each other, for example as a
source we have a RDD[List[String]] like

["a", "b, c", "d"]
["a", "d, a", "d"]

In the first step we split the second column in two columns, in the next
step we filter the data on column 3 = "c" and in the final step we're doing
something else. The point is that it needs to be flexible (the user adds
custom transformations and they are stacked on top of each other like the
example above, which transformations are added by the user is therefor not
known upfront).

The transformations itself are no problems but we want to keep track of the
added columns, the dropped columns and the updated columns. In the example
above, the second column is dropped and two new columns are added.

The intermediate result here will be

["a", "b", "c", "d"]
["a", "d", "a", "d"]

And the final result will be

["a", "b", "c", "d"]

What I would like to know is after each transformation which columns are
added, which colunms are dropped and which ones are updated.This is
information that's needed to execute the next transformation.

I was thinking of two possible scenario's:

1. capture the metadata and store that in the RDD, effectively creating a
RDD[List[String], List[Column], List[Column], List[Column]) object. Where
the last three List[Column] contain the new, dropped or updated columns.
This will result in an RDD with a lot of extra information on each row.
That information is not needed on each row but rather one time for the
whole split transformation

2. use accumulators to store the new, updated and dropped columns. But I
don't think this is feasible

Are there any better scenario's or how could I accomplish such a scenario?

thanks in advance,
Richard