You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "damccorm (via GitHub)" <gi...@apache.org> on 2023/05/08 13:58:52 UTC

[GitHub] [beam] damccorm commented on a diff in pull request #25823: Featue/online clustering

damccorm commented on code in PR #25823:
URL: https://github.com/apache/beam/pull/25823#discussion_r1187468031


##########
sdks/python/apache_beam/examples/california_housing_clustering.py:
##########
@@ -1,16 +1,50 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""A pipeline that uses OnlineClustering transform to group houses
+with a similar value together.
+
+This example uses the California Housing Prices dataset from kaggle.
+https://www.kaggle.com/datasets/camnugent/california-housing-prices
+
+In the first step of the pipeline, the clustering model is trained
+using the OnlineKMeans transform, then the AssignClusterLabels
+transform assigns a cluster to each record in the dataset. This
+transform makes use of the RunInference API under the hood.
+
+In order to set this example up, you will need one thing.
+1. Download the data from kaggle as csv
+
+california_housing_clustering.py --input /tmp/housing.csv

Review Comment:
   ```suggestion
   In order to run this example up:
   1. Download the data from kaggle as csv
   2. Run `python california_housing_clustering.py --input <path/to/housing.csv>`
   ```



##########
sdks/python/apache_beam/examples/california_housing_clustering.py:
##########
@@ -40,17 +74,30 @@ def run(
   if not test_pipeline:
     pipeline = beam.Pipeline(options=pipeline_options)
 
-  data = (pipeline | read_csv(known_args.input))
+  data = pipeline | read_csv(known_args.input)
 
   features = ['longitude', 'latitude', 'median_income']
 
   housing_features = to_pcollection(data[features])
 
+  # 1. Calculate clustering centers and save model to persistent storage
   _ = (
       housing_features
       | beam.Map(lambda record: list(record))
-      | OnlineClustering(
-          OnlineKMeans, n_clusters=6, batch_size=256, cluster_args={})
+      | "Train clustering model" >> OnlineClustering(
+          OnlineKMeans,
+          n_clusters=6,
+          batch_size=256,
+          cluster_args={},
+          checkpoints_path='/tmp/checkpoints'))

Review Comment:
   Can we make the checkpoints path configurable as a pipeline option? This will allow the pipeline to be run on remote runners (e.g. dataflow)



##########
sdks/python/apache_beam/examples/online_clustering.py:
##########
@@ -1,20 +1,65 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import datetime
 from operator import itemgetter
-from typing import Optional
 
+import joblib
 import numpy as np
 import pandas as pd
 from sklearn.cluster import MiniBatchKMeans
 
 import apache_beam as beam
-from apache_beam import pvalue
-
-from apache_beam.coders import PickleCoder, VarIntCoder
-from apache_beam.ml.inference.base import PredictionResult
+from apache_beam.coders import PickleCoder
+from apache_beam.coders import VarIntCoder
+from apache_beam.io.filesystems import FileSystems
+from apache_beam.ml.inference.base import RunInference
+from apache_beam.ml.inference.sklearn_inference import ModelFileType
+from apache_beam.ml.inference.sklearn_inference import SklearnModelHandlerNumpy
 from apache_beam.transforms import core
 from apache_beam.transforms import ptransform
 from apache_beam.transforms.userstate import ReadModifyWriteStateSpec
 
 
+class SaveModel(beam.DoFn):
+  """Saves trained clustering model to persistent storage"""
+  def __init__(self, checkpoints_path: str):
+    self.checkpoints_path = checkpoints_path
+
+  def process(self, model):
+    # generate ISO 8601
+    iso_timestamp = datetime.datetime.utcnow().strftime("%Y%m%dT%H%M%SZ")
+    checkpoint_name = f'{self.checkpoints_path}/{iso_timestamp}.checkpoint'
+    latest_checkpoint = f'{self.checkpoints_path}/latest.checkpoint'
+    # rename previous checkpoint
+    if FileSystems.exists(latest_checkpoint):
+      FileSystems.rename([latest_checkpoint], [checkpoint_name])
+    file = FileSystems.create(latest_checkpoint, 'wb')
+    if not joblib:
+      raise ImportError(
+          'Could not import joblib in this execution environment. '
+          'For help with managing dependencies on Python workers.'
+          'see https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/'  # pylint: disable=line-too-long
+      )
+
+    joblib.dump(model, file)
+
+    yield checkpoint_name

Review Comment:
   I think we should yield `(checkpoint_name, model)` to give users the ability to use the model directly if they want to.



##########
sdks/python/apache_beam/examples/california_housing_clustering.py:
##########
@@ -40,17 +74,30 @@ def run(
   if not test_pipeline:
     pipeline = beam.Pipeline(options=pipeline_options)
 
-  data = (pipeline | read_csv(known_args.input))
+  data = pipeline | read_csv(known_args.input)
 
   features = ['longitude', 'latitude', 'median_income']
 
   housing_features = to_pcollection(data[features])
 
+  # 1. Calculate clustering centers and save model to persistent storage
   _ = (
       housing_features
       | beam.Map(lambda record: list(record))
-      | OnlineClustering(
-          OnlineKMeans, n_clusters=6, batch_size=256, cluster_args={})
+      | "Train clustering model" >> OnlineClustering(
+          OnlineKMeans,
+          n_clusters=6,
+          batch_size=256,
+          cluster_args={},
+          checkpoints_path='/tmp/checkpoints'))
+
+  # 2. Calculate labels for all records in the dataset
+  # using the trained clustering model
+  _ = (

Review Comment:
   We can just call out that this introduces the problem of not classifying any data until all data has passed through and say that this could be resolved with the use of windowing and side inputs



##########
sdks/python/apache_beam/examples/california_housing_clustering.py:
##########
@@ -40,17 +74,30 @@ def run(
   if not test_pipeline:
     pipeline = beam.Pipeline(options=pipeline_options)
 
-  data = (pipeline | read_csv(known_args.input))
+  data = pipeline | read_csv(known_args.input)
 
   features = ['longitude', 'latitude', 'median_income']
 
   housing_features = to_pcollection(data[features])
 
+  # 1. Calculate clustering centers and save model to persistent storage
   _ = (
       housing_features
       | beam.Map(lambda record: list(record))
-      | OnlineClustering(
-          OnlineKMeans, n_clusters=6, batch_size=256, cluster_args={})
+      | "Train clustering model" >> OnlineClustering(
+          OnlineKMeans,
+          n_clusters=6,
+          batch_size=256,
+          cluster_args={},
+          checkpoints_path='/tmp/checkpoints'))
+
+  # 2. Calculate labels for all records in the dataset
+  # using the trained clustering model
+  _ = (

Review Comment:
   I don't think having 2 totally separate transforms is what we want. This transform should still be dependent on the first transform, and probably shouldn't use RunInference (since RunInference won't reload the model over time). Otherwise, this likely either won't work at all (and will throw when loading the model), or it will only ever use the cluster generated by the first few datapoints (and won't update over time)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org