You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Ashwin Raju <th...@gmail.com> on 2017/12/15 08:00:10 UTC

Recompute Spark outputs intelligently

Hi,

We have a batch processing application that reads logs files over multiple
days, does transformations and aggregations on them using Spark and saves
various intermediate outputs to Parquet. These jobs take many hours to run.
This pipeline is deployed at many customer sites with some site specific
variations.

When we want to make changes to this data pipeline, we delete all the
intermediate output and recompute from the point of change. On some sites,
we hand write a series of "migration" transformations so we do not have to
spend hours recomputing. The reason for changes might be bugs we have found
in our data transformations or new features added to the pipeline.

As you can probably tell, maintaining all these versions and figuring out
what migrations to perform is a headache. What would be ideal is when we
apply an updated pipeline, we can automatically figure out which columns
need to be recomputed and which can be left as is.

Is there a best practice in the Spark ecosystem for this problem? Perhaps
some metadata system/data lineage system we can use? I'm curious if this is
a common problem that has already been addressed.

Thanks,
Ashwin