You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Aleksander Eskilson <al...@gmail.com> on 2018/09/05 14:16:25 UTC

[ML] Setting Non-Transform Params for a Pipeline & PipelineModel

I had originally sent this to the Dev list since the API discussed here is
still marked as experimental in portions, but it occurs to me this may
still be a general use question, sorry for the cross-listing.

In a nutshell, what I'd like to do is instantiate a Pipeline (or extension
class of Pipeline) with metadata that is copied to the PipelineModel when
fitted, and can be read again when the fitted model is persisted and loaded
by another consumer. These metadata are specific to the PipelineModel more
than any particular Transform or the Estimator declared as part of the
Pipeline: the intent is that the PipelineModel params can be read by a
downstream consumer of the loaded model, but the value that the params
should take will only be known to the creator the of Pipeline/trainer of
the PipelineModel.

It seems that Pipeline and PipelineModel support the Params interface, like
Transform and Estimator do. It seems I can extend Pipeline to a custom
class MyPipeline, where the constructor could enforce that my metadata
Params are set. However, when the Pipeline is *fit*, the resultant
PipelineModel doesn't seem to include the original CustomPipeline's params,
only params from the individual Transform steps.

From a read of the code, it seems that the *fit* method will copy over the
Stages to the PipelineModel, and those will be persisted (along with the
Stages' Params) during *write*, *but* any Params belonging to the Pipeline
are not copied to the PipelineModel (as only Stages are considered during
copy, not the ParamMap of the Pipeline) [1].

Is this a correct read of the flow here? That a CustomPipeline extension of
Pipeline with member Params does not get those non-Transform Params copied
into the fitted PipelineMode?

If so, would a feature enhancement including Pipeline-specific Params being
copyable into the fitted PipelineModel be considered acceptable?

Or should there be another way to include metadata *about* the Pipeline
such that the metadata is copyable to the fitted PipelineModel, and able to
be persisted with PipelineModel *write* and read again with PipelineModel
*load*? My first attempt at this has been to extend the Pipeline class
itself with member params, but this doesn't seem to do the trick given how
Params are actually copied only for Stages between Pipeline and the fitted
PipelineModel.

It occurs to me I could write a custom *withMetadata* transform Stage which
would really just an identity function but with the desired Params built
in, and that those Params would get copied with the other Stages, but as
discussed at the top, this particular use-case for metadata isn't about any
particular Transform, but more about metadata for the whole Pipeline.

Alek

[1] --
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L135