You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Artemis User <ar...@dtechspace.com> on 2021/01/12 18:58:06 UTC

Customizing K-Means for Anomaly Detection

First some background:

  * We want to use the k-means model for anomaly detection against a
    multi-dimensional dataset.  The current k-means implementation in
    Spark is designed for clustering purpose, not exactly for anomaly
    detection.  Once a model is trained and pipeline is instantiated,
    the prediction data frame generated from the transform function only
    associates each data points with individual clusters.  To enable
    anomaly detection, we would need to recalculate distance of each
    data point to its corresponding or nearest cluster centroid, and
    compare with a predefined threshold value to determine anomalies
    (e.g. normal = distance <= threshold, and anomaly = distance >
    threshold).
  * The anomaly detection procedure (e.g. calculating the distances and
    compare them with the threshold) occurs outside the ML pipeline
    (e.g. after invoking the transform method). This causes problems
    when we try to persist the pipeline model and later retrieve and
    instantiate and use it in production. We really would like one
    Estimator to do this whole process, from ingesting data to anomaly
    detection in a single pipeline, without the extra code at the end
    (e.g. after pipeline.transform() is called).

Questions:

  * We wanted to just make a custom Transformer to append to the end of
    the Pipeline so to enable anomaly detection for the test dataset,
    BUT it requires the clusterCenters from the KMeansModel stage.  We
    can’t figure out how to pass this data, which comes from a fitted
    stage, to a later stage during runtime. Any Ideas?
  * Is there a way add a callback to the KMeansModel to persist the
    clusterCenters in the dataframe, or in a file?  or add a ParamMap to
    dynamically set this parameter during runtime?

Thanks a lot in advance!

-- ND


Re: Customizing K-Means for Anomaly Detection

Posted by Sean Owen <sr...@gmail.com>.
You could fit the k-means pipeline, get the cluster centers, create a
Transformer using that info, then create a new PipelineModel including all
the original elements and the new Transformer. Does that work?
It's not out of the question to expose a new parameter in KMeansModel that
lets you also add a column with the cost; I'd review that kind of PR.

On Tue, Jan 12, 2021 at 12:59 PM Artemis User <ar...@dtechspace.com>
wrote:

> First some background:
>
>    - We want to use the k-means model for anomaly detection against a
>    multi-dimensional dataset.  The current k-means implementation in Spark is
>    designed for clustering purpose, not exactly for anomaly detection.  Once a
>    model is trained and pipeline is instantiated, the prediction data frame
>    generated from the transform function only associates each data points with
>    individual clusters.  To enable anomaly detection, we would need to
>    recalculate distance of each data point to its corresponding or nearest
>    cluster centroid, and compare with a predefined threshold value to
>    determine anomalies (e.g. normal = distance <= threshold, and anomaly =
>    distance > threshold).
>    - The anomaly detection procedure (e.g. calculating the distances and
>    compare them with the threshold) occurs outside the ML pipeline (e.g. after
>    invoking the transform method).  This causes problems when we try to
>    persist the pipeline model and later retrieve and instantiate and use it in
>    production.   We really would like one Estimator to do this whole process,
>    from ingesting data to anomaly detection in a single pipeline, without the
>    extra code at the end (e.g. after pipeline.transform() is called).
>
> Questions:
>
>    - We wanted to just make a custom Transformer to append to the end of
>    the Pipeline so to enable anomaly detection for the test dataset, BUT it
>    requires the clusterCenters from the KMeansModel stage.  We can’t figure
>    out how to pass this data, which comes from a fitted stage, to a later
>    stage during runtime. Any Ideas?
>    - Is there a way add a callback to the KMeansModel to persist the
>    clusterCenters in the dataframe, or in a file?  or add a ParamMap to
>    dynamically set this parameter during runtime?
>
> Thanks a lot in advance!
>
> -- ND
>