You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/04/13 21:21:04 UTC

[GitHub] [beam] ryanthompson591 opened a new pull request, #17368: [BEAM-13983] Sklearn loader

ryanthompson591 opened a new pull request, #17368:
URL: https://github.com/apache/beam/pull/17368

   **Please** add a meaningful description for your change here
   Adds a loader for loading and predicting on sklearn models.
   
   
   ------------------------
   
   Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
   
    - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`).
    - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
    - [ ] Update `CHANGES.md` with noteworthy changes.
    - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
   
   To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md)
   
   GitHub Actions Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule)
   [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule)
   
   See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] ryanthompson591 commented on a diff in pull request #17368: [BEAM-13983] Sklearn Loader for RunInference

Posted by GitBox <gi...@apache.org>.

ryanthompson591 commented on code in PR #17368:
URL: https://github.com/apache/beam/pull/17368#discussion_r851385588


##########
sdks/python/setup.py:
##########
@@ -159,6 +159,7 @@ def get_version():
 
 REQUIRED_TEST_PACKAGES = [
     'freezegun>=0.3.12',
+    'joblib>=1.1.0',

Review Comment:
   It is in required packages, but not in test packages for whatever reason.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] asf-ci commented on pull request #17368: [BEAM-13983] Sklearn loader

Posted by GitBox <gi...@apache.org>.

asf-ci commented on PR #17368:
URL: https://github.com/apache/beam/pull/17368#issuecomment-1098497719

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] ryanthompson591 commented on a diff in pull request #17368: [BEAM-13983] Sklearn Loader for RunInference

Posted by GitBox <gi...@apache.org>.

ryanthompson591 commented on code in PR #17368:
URL: https://github.com/apache/beam/pull/17368#discussion_r853148651


##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,73 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import abc
+import enum
+import pickle
+import sys
+from dataclasses import dataclass
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import joblib
+import numpy
+
+import apache_beam.ml.inference.api as api
+import apache_beam.ml.inference.base as base
+import sklearn_loader
+from apache_beam.io.filesystems import FileSystems
+
+
+class SerializationType(enum.Enum):
+  PICKLE = 1
+  JOBLIB = 2
+
+
+class SKLearnInferenceRunner(base.InferenceRunner):
+  def run_inference(self, batch: List[numpy.array],
+                    model: Any) -> Iterable[numpy.array]:
+    # vectorize data for better performance
+    vectorized_batch = numpy.stack(batch, axis=0)
+    predictions = model.predict(vectorized_batch)
+    return [api.PredictionResult(x, y) for x, y in zip(batch, predictions)]
+
+  def get_num_bytes(self, batch: List[numpy.array]) -> int:
+    """Returns the number of bytes of data for a batch."""
+    return sum(sys.getsizeof(element) for element in batch)
+
+
+class SKLearnModelLoader(base.ModelLoader):
+  def __init__(
+      self,
+      serialization: SerializationType = SerializationType.PICKLE,

Review Comment:
   Ok, let's go with model_file_type, since that seems a little more clear than model_save_type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] ryanthompson591 commented on a diff in pull request #17368: [BEAM-13983] Sklearn Loader for RunInference

Posted by GitBox <gi...@apache.org>.

ryanthompson591 commented on code in PR #17368:
URL: https://github.com/apache/beam/pull/17368#discussion_r856226165


##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,71 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import enum
+import pickle
+import sys
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import joblib

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] TheNeuralBit commented on a diff in pull request #17368: [BEAM-13983] Sklearn Loader for RunInference

Posted by GitBox <gi...@apache.org>.

TheNeuralBit commented on code in PR #17368:
URL: https://github.com/apache/beam/pull/17368#discussion_r856490682


##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,78 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import enum
+import pickle
+import sys
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import numpy
+
+from apache_beam.io.filesystems import FileSystems
+from apache_beam.ml.inference.api import PredictionResult
+from apache_beam.ml.inference.base import InferenceRunner
+from apache_beam.ml.inference.base import ModelLoader
+
+try:
+  import joblib
+except ImportError:
+  # joblib is an optional dependency.
+  pass
+
+
+class ModelFileType(enum.Enum):
+  PICKLE = 1
+  JOBLIB = 2
+
+
+class SklearnInferenceRunner(InferenceRunner):
+  def run_inference(self, batch: List[numpy.array],
+                    model: Any) -> Iterable[numpy.array]:
+    # vectorize data for better performance
+    vectorized_batch = numpy.stack(batch, axis=0)
+    predictions = model.predict(vectorized_batch)
+    return [PredictionResult(x, y) for x, y in zip(batch, predictions)]
+
+  def get_num_bytes(self, batch: List[numpy.array]) -> int:
+    """Returns the number of bytes of data for a batch."""
+    return sum(sys.getsizeof(element) for element in batch)
+
+
+class SklearnModelLoader(ModelLoader):
+  def __init__(
+      self,
+      model_file_type: ModelFileType = ModelFileType.PICKLE,
+      model_uri: str = ''):
+    self._model_file_type = model_file_type
+    self._model_uri = model_uri
+    self._inference_runner = SklearnInferenceRunner()
+
+  def load_model(self):
+    """Loads and initializes a model for processing."""
+    file = FileSystems.open(self._model_uri, 'rb')
+    if self._model_file_type == ModelFileType.PICKLE:
+      return pickle.load(file)
+    elif self._model_file_type == ModelFileType.JOBLIB:
+      if not joblib:
+        raise ImportError('Joblib not available in SklearnModelLoader.')

Review Comment:
   Note this will be an execution time error causing workers to crash, which isn't a great experience. You might consider pointing this to some documentation about setting up dependencies.



##########
sdks/python/setup.py:
##########
@@ -169,6 +170,7 @@ def get_version():
     'pytest>=4.4.0,<5.0',
     'pytest-xdist>=1.29.0,<2',
     'pytest-timeout>=1.3.3,<2',
+    'scikit-learn>=0.24.2',

Review Comment:
   It would be good to get an answer on the above (and similarly for joblib)



##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,78 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import enum
+import pickle
+import sys
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import numpy
+
+from apache_beam.io.filesystems import FileSystems
+from apache_beam.ml.inference.api import PredictionResult
+from apache_beam.ml.inference.base import InferenceRunner
+from apache_beam.ml.inference.base import ModelLoader
+
+try:
+  import joblib
+except ImportError:
+  # joblib is an optional dependency.
+  pass
+
+
+class ModelFileType(enum.Enum):
+  PICKLE = 1
+  JOBLIB = 2
+
+
+class SklearnInferenceRunner(InferenceRunner):
+  def run_inference(self, batch: List[numpy.array],
+                    model: Any) -> Iterable[numpy.array]:
+    # vectorize data for better performance
+    vectorized_batch = numpy.stack(batch, axis=0)
+    predictions = model.predict(vectorized_batch)
+    return [PredictionResult(x, y) for x, y in zip(batch, predictions)]
+
+  def get_num_bytes(self, batch: List[numpy.array]) -> int:
+    """Returns the number of bytes of data for a batch."""
+    return sum(sys.getsizeof(element) for element in batch)
+
+
+class SklearnModelLoader(ModelLoader):
+  def __init__(
+      self,
+      model_file_type: ModelFileType = ModelFileType.PICKLE,
+      model_uri: str = ''):
+    self._model_file_type = model_file_type
+    self._model_uri = model_uri
+    self._inference_runner = SklearnInferenceRunner()
+
+  def load_model(self):
+    """Loads and initializes a model for processing."""
+    file = FileSystems.open(self._model_uri, 'rb')
+    if self._model_file_type == ModelFileType.PICKLE:
+      return pickle.load(file)
+    elif self._model_file_type == ModelFileType.JOBLIB:
+      if not joblib:
+        raise ImportError('Joblib not available in SklearnModelLoader.')
+      return joblib.load(file)
+    raise TypeError('Unsupported serialization type.')

Review Comment:
   nit: I might make this an assertion error since it's a state we shouldn't get to given ModelFileType is an enum.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] yeandy commented on a diff in pull request #17368: [BEAM-13983] Sklearn Loader for RunInference

Posted by GitBox <gi...@apache.org>.

yeandy commented on code in PR #17368:
URL: https://github.com/apache/beam/pull/17368#discussion_r851304959


##########
sdks/python/setup.py:
##########
@@ -169,6 +170,7 @@ def get_version():
     'pytest>=4.4.0,<5.0',
     'pytest-xdist>=1.29.0,<2',
     'pytest-timeout>=1.3.3,<2',
+    'scikit-learn>=0.24.2',

Review Comment:
   How does this affect `scikit-learn` in the [base_image_requirements_manual.txt](https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements_manual.txt#L42)?



##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,73 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import abc
+import enum
+import pickle
+import sys
+from dataclasses import dataclass
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import joblib
+import numpy
+
+import apache_beam.ml.inference.api as api
+import apache_beam.ml.inference.base as base
+import sklearn_loader
+from apache_beam.io.filesystems import FileSystems
+
+
+class SerializationType(enum.Enum):
+  PICKLE = 1
+  JOBLIB = 2
+
+
+class SKLearnInferenceRunner(base.InferenceRunner):
+  def run_inference(self, batch: List[numpy.array],
+                    model: Any) -> Iterable[numpy.array]:
+    # vectorize data for better performance
+    vectorized_batch = numpy.stack(batch, axis=0)
+    predictions = model.predict(vectorized_batch)
+    return [api.PredictionResult(x, y) for x, y in zip(batch, predictions)]
+
+  def get_num_bytes(self, batch: List[numpy.array]) -> int:
+    """Returns the number of bytes of data for a batch."""
+    return sum(sys.getsizeof(element) for element in batch)
+
+
+class SKLearnModelLoader(base.ModelLoader):
+  def __init__(
+      self,
+      serialization: SerializationType = SerializationType.PICKLE,

Review Comment:
   ```suggestion
         serialization_type: SerializationType = SerializationType.PICKLE,
   ```
   Would it be more clear to have `serialization_type`? 



##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,73 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import abc
+import enum
+import pickle
+import sys
+from dataclasses import dataclass
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import joblib
+import numpy
+
+import apache_beam.ml.inference.api as api
+import apache_beam.ml.inference.base as base
+import sklearn_loader
+from apache_beam.io.filesystems import FileSystems
+
+
+class SerializationType(enum.Enum):
+  PICKLE = 1
+  JOBLIB = 2
+
+
+class SKLearnInferenceRunner(base.InferenceRunner):
+  def run_inference(self, batch: List[numpy.array],
+                    model: Any) -> Iterable[numpy.array]:
+    # vectorize data for better performance
+    vectorized_batch = numpy.stack(batch, axis=0)
+    predictions = model.predict(vectorized_batch)
+    return [api.PredictionResult(x, y) for x, y in zip(batch, predictions)]
+
+  def get_num_bytes(self, batch: List[numpy.array]) -> int:
+    """Returns the number of bytes of data for a batch."""
+    return sum(sys.getsizeof(element) for element in batch)

Review Comment:
   Does `sys.getsizeof(element)` return the size in bytes of all features in the `element` numpy array? i.e. if each `element` has 4 numeric features, with each feature as 4 bytes, then it will return 16? And then if we have 2 `elements` in the `batch`, then we will return 32?



##########
sdks/python/setup.py:
##########
@@ -159,6 +159,7 @@ def get_version():
 
 REQUIRED_TEST_PACKAGES = [
     'freezegun>=0.3.12',
+    'joblib>=1.1.0',

Review Comment:
   Should we have `joblib` be in the `REQUIRED_PACKAGES`? technically it's being used in the regular `sklearn_loader.py` file.



##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,73 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import abc
+import enum
+import pickle
+import sys
+from dataclasses import dataclass
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import joblib
+import numpy
+
+import apache_beam.ml.inference.api as api
+import apache_beam.ml.inference.base as base
+import sklearn_loader
+from apache_beam.io.filesystems import FileSystems
+
+
+class SerializationType(enum.Enum):
+  PICKLE = 1
+  JOBLIB = 2
+
+
+class SKLearnInferenceRunner(base.InferenceRunner):
+  def run_inference(self, batch: List[numpy.array],
+                    model: Any) -> Iterable[numpy.array]:
+    # vectorize data for better performance
+    vectorized_batch = numpy.stack(batch, axis=0)
+    predictions = model.predict(vectorized_batch)
+    return [api.PredictionResult(x, y) for x, y in zip(batch, predictions)]
+
+  def get_num_bytes(self, batch: List[numpy.array]) -> int:
+    """Returns the number of bytes of data for a batch."""
+    return sum(sys.getsizeof(element) for element in batch)
+
+
+class SKLearnModelLoader(base.ModelLoader):
+  def __init__(
+      self,
+      serialization: SerializationType = SerializationType.PICKLE,
+      model_uri: str = ''):
+    self._serialization = serialization
+    self._model_uri = model_uri

Review Comment:
   Should we try to be as consistent as possible across frameworks? For example, for Pytorch, I use `state_dict_path`, but I could change it to be `state_dict_uri`. 



##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,73 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import abc
+import enum
+import pickle
+import sys
+from dataclasses import dataclass
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import joblib
+import numpy
+
+import apache_beam.ml.inference.api as api
+import apache_beam.ml.inference.base as base
+import sklearn_loader
+from apache_beam.io.filesystems import FileSystems
+
+
+class SerializationType(enum.Enum):
+  PICKLE = 1
+  JOBLIB = 2
+
+
+class SKLearnInferenceRunner(base.InferenceRunner):

Review Comment:
   Silly question, but for the sake of consistency and ease of use, what should the naming convention be for the different frameworks? For example, technically `PyTorch` has capital P and T, but in my implementation, I use `Pytorch` for simplicity. (I can change it though)
   
   And for `Scikit-learn`, it's often abbreviated as `sklearn` or `Sklearn`, without the capital K or L. Should we change it to `SklearnInferenceRunner`?
   



##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,73 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import abc
+import enum
+import pickle
+import sys
+from dataclasses import dataclass
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import joblib
+import numpy
+
+import apache_beam.ml.inference.api as api
+import apache_beam.ml.inference.base as base
+import sklearn_loader
+from apache_beam.io.filesystems import FileSystems
+
+
+class SerializationType(enum.Enum):
+  PICKLE = 1
+  JOBLIB = 2
+
+
+class SKLearnInferenceRunner(base.InferenceRunner):
+  def run_inference(self, batch: List[numpy.array],
+                    model: Any) -> Iterable[numpy.array]:
+    # vectorize data for better performance
+    vectorized_batch = numpy.stack(batch, axis=0)
+    predictions = model.predict(vectorized_batch)
+    return [api.PredictionResult(x, y) for x, y in zip(batch, predictions)]

Review Comment:
   Nice!



##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,73 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import abc
+import enum
+import pickle
+import sys
+from dataclasses import dataclass
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import joblib
+import numpy
+
+import apache_beam.ml.inference.api as api
+import apache_beam.ml.inference.base as base
+import sklearn_loader
+from apache_beam.io.filesystems import FileSystems
+
+
+class SerializationType(enum.Enum):
+  PICKLE = 1
+  JOBLIB = 2
+
+
+class SKLearnInferenceRunner(base.InferenceRunner):
+  def run_inference(self, batch: List[numpy.array],
+                    model: Any) -> Iterable[numpy.array]:
+    # vectorize data for better performance
+    vectorized_batch = numpy.stack(batch, axis=0)
+    predictions = model.predict(vectorized_batch)
+    return [api.PredictionResult(x, y) for x, y in zip(batch, predictions)]
+
+  def get_num_bytes(self, batch: List[numpy.array]) -> int:
+    """Returns the number of bytes of data for a batch."""
+    return sum(sys.getsizeof(element) for element in batch)
+
+
+class SKLearnModelLoader(base.ModelLoader):
+  def __init__(
+      self,
+      serialization: SerializationType = SerializationType.PICKLE,
+      model_uri: str = ''):
+    self._serialization = serialization
+    self._model_uri = model_uri
+    self._inference_runner = SKLearnInferenceRunner()
+
+  def load_model(self):
+    """Loads and initializes a model for processing."""
+    file = FileSystems.open(self._model_uri, 'rb')
+    if self._serialization == SerializationType.PICKLE:
+      return pickle.load(file)
+    elif self._serialization == SerializationType.JOBLIB:
+      return joblib.load(file)
+    raise ValueError('No supported serialization type.')

Review Comment:
   Assuming the user is only picking from the `SerializationType` enums (and using type checking), we will never hit this case, right?
   
   Can we add in the error message the value of `self._serialization`? And also add a test for this?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] yeandy commented on a diff in pull request #17368: [BEAM-13983] Sklearn Loader for RunInference

Posted by GitBox <gi...@apache.org>.

yeandy commented on code in PR #17368:
URL: https://github.com/apache/beam/pull/17368#discussion_r852194193


##########
sdks/python/apache_beam/ml/inference/sklearn_loader_test.py:
##########
@@ -151,6 +151,12 @@ def test_bad_file_raises(self):
             SklearnModelLoader(model_uri='/var/bad_file_name'))
         pipeline.run()
 
+  def test_bad_input_type_raises(self):
+    with tempfile.NamedTemporaryFile() as file:
+      with self.assertRaises(TypeError):

Review Comment:
   just to be super clear that it's the custom error that you wrote, it might be better to do `self.assertRaisesRegex(TypeError, 'Unsupported serialization type.')`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] codecov[bot] commented on pull request #17368: [BEAM-13983] Sklearn Loader for RunInference

Posted by GitBox <gi...@apache.org>.

codecov[bot] commented on PR #17368:
URL: https://github.com/apache/beam/pull/17368#issuecomment-1099620459

   # [Codecov](https://codecov.io/gh/apache/beam/pull/17368?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#17368](https://codecov.io/gh/apache/beam/pull/17368?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (9a5e7f2) into [master](https://codecov.io/gh/apache/beam/commit/dffa7c160fb19728b2ed4d9462459915a9bf737a?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (dffa7c1) will **increase** coverage by `0.05%`.
   > The diff coverage is `97.36%`.
   
   ```diff
   @@            Coverage Diff             @@
   ##           master   #17368      +/-   ##
   ==========================================
   + Coverage   73.94%   74.00%   +0.05%     
   ==========================================
     Files         684      686       +2     
     Lines       89519    89773     +254     
   ==========================================
   + Hits        66194    66435     +241     
   - Misses      22165    22178      +13     
     Partials     1160     1160              
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | python | `83.68% <97.36%> (+0.04%)` | :arrow_up: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/beam/pull/17368?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [.../python/apache\_beam/ml/inference/sklearn\_loader.py](https://codecov.io/gh/apache/beam/pull/17368/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vbWwvaW5mZXJlbmNlL3NrbGVhcm5fbG9hZGVyLnB5) | `97.36% <97.36%> (ø)` | |
   | [...eam/runners/portability/fn\_api\_runner/execution.py](https://codecov.io/gh/apache/beam/pull/17368/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vcnVubmVycy9wb3J0YWJpbGl0eS9mbl9hcGlfcnVubmVyL2V4ZWN1dGlvbi5weQ==) | `92.25% <0.00%> (-0.81%)` | :arrow_down: |
   | [...apache\_beam/runners/dataflow/internal/apiclient.py](https://codecov.io/gh/apache/beam/pull/17368/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vcnVubmVycy9kYXRhZmxvdy9pbnRlcm5hbC9hcGljbGllbnQucHk=) | `77.36% <0.00%> (-0.21%)` | :arrow_down: |
   | [.../runners/portability/fn\_api\_runner/translations.py](https://codecov.io/gh/apache/beam/pull/17368/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vcnVubmVycy9wb3J0YWJpbGl0eS9mbl9hcGlfcnVubmVyL3RyYW5zbGF0aW9ucy5weQ==) | `92.98% <0.00%> (-0.20%)` | :arrow_down: |
   | [...hon/apache\_beam/runners/worker/bundle\_processor.py](https://codecov.io/gh/apache/beam/pull/17368/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vcnVubmVycy93b3JrZXIvYnVuZGxlX3Byb2Nlc3Nvci5weQ==) | `93.39% <0.00%> (-0.13%)` | :arrow_down: |
   | [sdks/python/apache\_beam/io/iobase.py](https://codecov.io/gh/apache/beam/pull/17368/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vaW8vaW9iYXNlLnB5) | `86.18% <0.00%> (-0.04%)` | :arrow_down: |
   | [setup.py](https://codecov.io/gh/apache/beam/pull/17368/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2V0dXAucHk=) | `0.00% <0.00%> (ø)` | |
   | [...on/apache\_beam/runners/dataflow/dataflow\_runner.py](https://codecov.io/gh/apache/beam/pull/17368/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vcnVubmVycy9kYXRhZmxvdy9kYXRhZmxvd19ydW5uZXIucHk=) | `82.89% <0.00%> (ø)` | |
   | [sdks/python/apache\_beam/ml/inference/base.py](https://codecov.io/gh/apache/beam/pull/17368/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vbWwvaW5mZXJlbmNlL2Jhc2UucHk=) | `92.53% <0.00%> (ø)` | |
   | [sdks/python/apache\_beam/dataframe/frames.py](https://codecov.io/gh/apache/beam/pull/17368/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vZGF0YWZyYW1lL2ZyYW1lcy5weQ==) | `95.26% <0.00%> (+0.05%)` | :arrow_up: |
   | ... and [8 more](https://codecov.io/gh/apache/beam/pull/17368/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/beam/pull/17368?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/beam/pull/17368?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [dffa7c1...9a5e7f2](https://codecov.io/gh/apache/beam/pull/17368?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] yeandy commented on a diff in pull request #17368: [BEAM-13983] Sklearn Loader for RunInference

Posted by GitBox <gi...@apache.org>.

yeandy commented on code in PR #17368:
URL: https://github.com/apache/beam/pull/17368#discussion_r851341847


##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,73 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import abc
+import enum
+import pickle
+import sys
+from dataclasses import dataclass
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import joblib
+import numpy
+
+import apache_beam.ml.inference.api as api
+import apache_beam.ml.inference.base as base
+import sklearn_loader
+from apache_beam.io.filesystems import FileSystems
+
+
+class SerializationType(enum.Enum):
+  PICKLE = 1
+  JOBLIB = 2
+
+
+class SKLearnInferenceRunner(base.InferenceRunner):
+  def run_inference(self, batch: List[numpy.array],
+                    model: Any) -> Iterable[numpy.array]:
+    # vectorize data for better performance
+    vectorized_batch = numpy.stack(batch, axis=0)
+    predictions = model.predict(vectorized_batch)
+    return [api.PredictionResult(x, y) for x, y in zip(batch, predictions)]
+
+  def get_num_bytes(self, batch: List[numpy.array]) -> int:
+    """Returns the number of bytes of data for a batch."""
+    return sum(sys.getsizeof(element) for element in batch)
+
+
+class SKLearnModelLoader(base.ModelLoader):
+  def __init__(
+      self,
+      serialization: SerializationType = SerializationType.PICKLE,
+      model_uri: str = ''):
+    self._serialization = serialization
+    self._model_uri = model_uri

Review Comment:
   The standard Pytorch way is just a single file storing the model spec parameters.
   
   I'll probably stick with `state_dict_path` because that's the Pytorch [convention](https://pytorch.org/tutorials/beginner/saving_loading_models.html) of storing saved models.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] yeandy commented on a diff in pull request #17368: [BEAM-13983] Sklearn Loader for RunInference

Posted by GitBox <gi...@apache.org>.

yeandy commented on code in PR #17368:
URL: https://github.com/apache/beam/pull/17368#discussion_r851340578


##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,73 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import abc
+import enum
+import pickle
+import sys
+from dataclasses import dataclass
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import joblib
+import numpy
+
+import apache_beam.ml.inference.api as api
+import apache_beam.ml.inference.base as base
+import sklearn_loader
+from apache_beam.io.filesystems import FileSystems
+
+
+class SerializationType(enum.Enum):
+  PICKLE = 1
+  JOBLIB = 2
+
+
+class SKLearnInferenceRunner(base.InferenceRunner):
+  def run_inference(self, batch: List[numpy.array],
+                    model: Any) -> Iterable[numpy.array]:
+    # vectorize data for better performance
+    vectorized_batch = numpy.stack(batch, axis=0)
+    predictions = model.predict(vectorized_batch)
+    return [api.PredictionResult(x, y) for x, y in zip(batch, predictions)]
+
+  def get_num_bytes(self, batch: List[numpy.array]) -> int:
+    """Returns the number of bytes of data for a batch."""
+    return sum(sys.getsizeof(element) for element in batch)
+
+
+class SKLearnModelLoader(base.ModelLoader):
+  def __init__(
+      self,
+      serialization: SerializationType = SerializationType.PICKLE,

Review Comment:
   Personally, I'd rank `serialization_method`, then `model_save_type`. But I agree that it's not that important since there would be documentation. I'll leave it up to you want to change the name away from `serialization`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] ryanthompson591 commented on a diff in pull request #17368: [BEAM-13983] Sklearn Loader for RunInference

Posted by GitBox <gi...@apache.org>.

ryanthompson591 commented on code in PR #17368:
URL: https://github.com/apache/beam/pull/17368#discussion_r857769694


##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,78 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import enum
+import pickle
+import sys
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import numpy
+
+from apache_beam.io.filesystems import FileSystems
+from apache_beam.ml.inference.api import PredictionResult
+from apache_beam.ml.inference.base import InferenceRunner
+from apache_beam.ml.inference.base import ModelLoader
+
+try:
+  import joblib
+except ImportError:
+  # joblib is an optional dependency.
+  pass
+
+
+class ModelFileType(enum.Enum):
+  PICKLE = 1
+  JOBLIB = 2
+
+
+class SklearnInferenceRunner(InferenceRunner):
+  def run_inference(self, batch: List[numpy.array],
+                    model: Any) -> Iterable[numpy.array]:
+    # vectorize data for better performance
+    vectorized_batch = numpy.stack(batch, axis=0)
+    predictions = model.predict(vectorized_batch)
+    return [PredictionResult(x, y) for x, y in zip(batch, predictions)]
+
+  def get_num_bytes(self, batch: List[numpy.array]) -> int:
+    """Returns the number of bytes of data for a batch."""
+    return sum(sys.getsizeof(element) for element in batch)
+
+
+class SklearnModelLoader(ModelLoader):
+  def __init__(
+      self,
+      model_file_type: ModelFileType = ModelFileType.PICKLE,
+      model_uri: str = ''):
+    self._model_file_type = model_file_type
+    self._model_uri = model_uri
+    self._inference_runner = SklearnInferenceRunner()
+
+  def load_model(self):
+    """Loads and initializes a model for processing."""
+    file = FileSystems.open(self._model_uri, 'rb')
+    if self._model_file_type == ModelFileType.PICKLE:
+      return pickle.load(file)
+    elif self._model_file_type == ModelFileType.JOBLIB:
+      if not joblib:
+        raise ImportError('Joblib not available in SklearnModelLoader.')
+      return joblib.load(file)
+    raise TypeError('Unsupported serialization type.')

Review Comment:
   Done.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] ryanthompson591 commented on pull request #17368: [BEAM-13983] Sklearn Loader for RunInference

Posted by GitBox <gi...@apache.org>.

ryanthompson591 commented on PR #17368:
URL: https://github.com/apache/beam/pull/17368#issuecomment-1101598836

   R: @TheNeuralBit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] ryanthompson591 commented on a diff in pull request #17368: [BEAM-13983] Sklearn Loader for RunInference

Posted by GitBox <gi...@apache.org>.

ryanthompson591 commented on code in PR #17368:
URL: https://github.com/apache/beam/pull/17368#discussion_r855705844


##########
sdks/python/setup.py:
##########
@@ -159,6 +159,7 @@ def get_version():
 
 REQUIRED_TEST_PACKAGES = [
     'freezegun>=0.3.12',
+    'joblib>=1.1.0',

Review Comment:
   OK.  I've made joblib optional.
   
   Unless its necessary I'll keep it out of REQUIRED_PACKAGES for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] ryanthompson591 commented on a diff in pull request #17368: [BEAM-13983] Sklearn Loader for RunInference

Posted by GitBox <gi...@apache.org>.

ryanthompson591 commented on code in PR #17368:
URL: https://github.com/apache/beam/pull/17368#discussion_r851347805


##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,73 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import abc
+import enum
+import pickle
+import sys
+from dataclasses import dataclass
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import joblib
+import numpy
+
+import apache_beam.ml.inference.api as api
+import apache_beam.ml.inference.base as base
+import sklearn_loader
+from apache_beam.io.filesystems import FileSystems
+
+
+class SerializationType(enum.Enum):
+  PICKLE = 1
+  JOBLIB = 2
+
+
+class SKLearnInferenceRunner(base.InferenceRunner):

Review Comment:
   Looking at sklearn in particular it is always lower case.
   
   Elsewhere in Beam we do have consecutive capitals like:
   PValue
   ReadFromTFRecord
   
   I think for PyTorch you should just follow what you think users would expect (WRT naming).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] TheNeuralBit merged pull request #17368: [BEAM-13983] Sklearn Loader for RunInference

Posted by GitBox <gi...@apache.org>.

TheNeuralBit merged PR #17368:
URL: https://github.com/apache/beam/pull/17368


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] ryanthompson591 commented on a diff in pull request #17368: [BEAM-13983] Sklearn Loader for RunInference

Posted by GitBox <gi...@apache.org>.

ryanthompson591 commented on code in PR #17368:
URL: https://github.com/apache/beam/pull/17368#discussion_r851329698


##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,73 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import abc
+import enum
+import pickle
+import sys
+from dataclasses import dataclass
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import joblib
+import numpy
+
+import apache_beam.ml.inference.api as api
+import apache_beam.ml.inference.base as base
+import sklearn_loader
+from apache_beam.io.filesystems import FileSystems
+
+
+class SerializationType(enum.Enum):
+  PICKLE = 1
+  JOBLIB = 2
+
+
+class SKLearnInferenceRunner(base.InferenceRunner):
+  def run_inference(self, batch: List[numpy.array],
+                    model: Any) -> Iterable[numpy.array]:
+    # vectorize data for better performance
+    vectorized_batch = numpy.stack(batch, axis=0)
+    predictions = model.predict(vectorized_batch)
+    return [api.PredictionResult(x, y) for x, y in zip(batch, predictions)]
+
+  def get_num_bytes(self, batch: List[numpy.array]) -> int:
+    """Returns the number of bytes of data for a batch."""
+    return sum(sys.getsizeof(element) for element in batch)
+
+
+class SKLearnModelLoader(base.ModelLoader):
+  def __init__(
+      self,
+      serialization: SerializationType = SerializationType.PICKLE,

Review Comment:
   Yeah, I don't think its worth having a really long name.  Users can use documentation to choose.
   
   Do you have a preference:
   file_type
   serialization_method
   import_lib
   model_save_type
   load_model_lib
   model_lib



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] asf-ci commented on pull request #17368: [BEAM-13983] Sklearn loader

Posted by GitBox <gi...@apache.org>.

asf-ci commented on PR #17368:
URL: https://github.com/apache/beam/pull/17368#issuecomment-1098497717

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] ryanthompson591 commented on pull request #17368: [BEAM-13983] Sklearn Loader for RunInference

Posted by GitBox <gi...@apache.org>.

ryanthompson591 commented on PR #17368:
URL: https://github.com/apache/beam/pull/17368#issuecomment-1100127642

   R: @yeandy 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] ryanthompson591 commented on a diff in pull request #17368: [BEAM-13983] Sklearn Loader for RunInference

Posted by GitBox <gi...@apache.org>.

ryanthompson591 commented on code in PR #17368:
URL: https://github.com/apache/beam/pull/17368#discussion_r851385786


##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,73 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import abc
+import enum
+import pickle
+import sys
+from dataclasses import dataclass
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import joblib
+import numpy
+
+import apache_beam.ml.inference.api as api
+import apache_beam.ml.inference.base as base
+import sklearn_loader
+from apache_beam.io.filesystems import FileSystems
+
+
+class SerializationType(enum.Enum):
+  PICKLE = 1
+  JOBLIB = 2
+
+
+class SKLearnInferenceRunner(base.InferenceRunner):
+  def run_inference(self, batch: List[numpy.array],
+                    model: Any) -> Iterable[numpy.array]:
+    # vectorize data for better performance
+    vectorized_batch = numpy.stack(batch, axis=0)
+    predictions = model.predict(vectorized_batch)
+    return [api.PredictionResult(x, y) for x, y in zip(batch, predictions)]

Review Comment:
   yeah, feel free to use in the pytorch impl.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] yeandy commented on a diff in pull request #17368: [BEAM-13983] Sklearn Loader for RunInference

Posted by GitBox <gi...@apache.org>.

yeandy commented on code in PR #17368:
URL: https://github.com/apache/beam/pull/17368#discussion_r851315082


##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,73 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import abc
+import enum
+import pickle
+import sys
+from dataclasses import dataclass
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import joblib
+import numpy
+
+import apache_beam.ml.inference.api as api
+import apache_beam.ml.inference.base as base

Review Comment:
   Would it be more clear to do absolute imports
   `from apache_beam.ml.inference.base import InferenceRunner`?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] ryanthompson591 commented on a diff in pull request #17368: [BEAM-13983] Sklearn Loader for RunInference

Posted by GitBox <gi...@apache.org>.

ryanthompson591 commented on code in PR #17368:
URL: https://github.com/apache/beam/pull/17368#discussion_r851358859


##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,73 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import abc
+import enum
+import pickle
+import sys
+from dataclasses import dataclass
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import joblib
+import numpy
+
+import apache_beam.ml.inference.api as api
+import apache_beam.ml.inference.base as base
+import sklearn_loader
+from apache_beam.io.filesystems import FileSystems
+
+
+class SerializationType(enum.Enum):
+  PICKLE = 1
+  JOBLIB = 2
+
+
+class SKLearnInferenceRunner(base.InferenceRunner):
+  def run_inference(self, batch: List[numpy.array],
+                    model: Any) -> Iterable[numpy.array]:
+    # vectorize data for better performance
+    vectorized_batch = numpy.stack(batch, axis=0)
+    predictions = model.predict(vectorized_batch)
+    return [api.PredictionResult(x, y) for x, y in zip(batch, predictions)]
+
+  def get_num_bytes(self, batch: List[numpy.array]) -> int:
+    """Returns the number of bytes of data for a batch."""
+    return sum(sys.getsizeof(element) for element in batch)
+
+
+class SKLearnModelLoader(base.ModelLoader):
+  def __init__(
+      self,
+      serialization: SerializationType = SerializationType.PICKLE,
+      model_uri: str = ''):
+    self._serialization = serialization
+    self._model_uri = model_uri
+    self._inference_runner = SKLearnInferenceRunner()
+
+  def load_model(self):
+    """Loads and initializes a model for processing."""
+    file = FileSystems.open(self._model_uri, 'rb')
+    if self._serialization == SerializationType.PICKLE:
+      return pickle.load(file)
+    elif self._serialization == SerializationType.JOBLIB:
+      return joblib.load(file)
+    raise ValueError('No supported serialization type.')

Review Comment:
   Yeah, this shouldn't happen, I added a simple test just to make sure.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] ryanthompson591 commented on a diff in pull request #17368: [BEAM-13983] Sklearn Loader for RunInference

Posted by GitBox <gi...@apache.org>.

ryanthompson591 commented on code in PR #17368:
URL: https://github.com/apache/beam/pull/17368#discussion_r851391753


##########
sdks/python/setup.py:
##########
@@ -169,6 +170,7 @@ def get_version():
     'pytest>=4.4.0,<5.0',
     'pytest-xdist>=1.29.0,<2',
     'pytest-timeout>=1.3.3,<2',
+    'scikit-learn>=0.24.2',

Review Comment:
   AFAIK this is just for testing, like unit and e2e tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] ryanthompson591 commented on a diff in pull request #17368: [BEAM-13983] Sklearn Loader for RunInference

Posted by GitBox <gi...@apache.org>.

ryanthompson591 commented on code in PR #17368:
URL: https://github.com/apache/beam/pull/17368#discussion_r856252751


##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,73 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import abc
+import enum
+import pickle
+import sys
+from dataclasses import dataclass
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import joblib
+import numpy
+
+import apache_beam.ml.inference.api as api
+import apache_beam.ml.inference.base as base
+import sklearn_loader
+from apache_beam.io.filesystems import FileSystems
+
+
+class SerializationType(enum.Enum):
+  PICKLE = 1
+  JOBLIB = 2
+
+
+class SKLearnInferenceRunner(base.InferenceRunner):
+  def run_inference(self, batch: List[numpy.array],
+                    model: Any) -> Iterable[numpy.array]:
+    # vectorize data for better performance
+    vectorized_batch = numpy.stack(batch, axis=0)
+    predictions = model.predict(vectorized_batch)
+    return [api.PredictionResult(x, y) for x, y in zip(batch, predictions)]
+
+  def get_num_bytes(self, batch: List[numpy.array]) -> int:
+    """Returns the number of bytes of data for a batch."""
+    return sum(sys.getsizeof(element) for element in batch)
+
+
+class SKLearnModelLoader(base.ModelLoader):
+  def __init__(
+      self,
+      serialization: SerializationType = SerializationType.PICKLE,
+      model_uri: str = ''):
+    self._serialization = serialization
+    self._model_uri = model_uri

Review Comment:
   I agree.
   
   A state dict is a python dictionary object used by pytorch. It's strange to use that name for sklearn. I like model uri, model_path is fine too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] ryanthompson591 commented on a diff in pull request #17368: [BEAM-13983] Sklearn Loader for RunInference

Posted by GitBox <gi...@apache.org>.

ryanthompson591 commented on code in PR #17368:
URL: https://github.com/apache/beam/pull/17368#discussion_r857769301


##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,78 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import enum
+import pickle
+import sys
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import numpy
+
+from apache_beam.io.filesystems import FileSystems
+from apache_beam.ml.inference.api import PredictionResult
+from apache_beam.ml.inference.base import InferenceRunner
+from apache_beam.ml.inference.base import ModelLoader
+
+try:
+  import joblib
+except ImportError:
+  # joblib is an optional dependency.
+  pass
+
+
+class ModelFileType(enum.Enum):
+  PICKLE = 1
+  JOBLIB = 2
+
+
+class SklearnInferenceRunner(InferenceRunner):
+  def run_inference(self, batch: List[numpy.array],
+                    model: Any) -> Iterable[numpy.array]:
+    # vectorize data for better performance
+    vectorized_batch = numpy.stack(batch, axis=0)
+    predictions = model.predict(vectorized_batch)
+    return [PredictionResult(x, y) for x, y in zip(batch, predictions)]
+
+  def get_num_bytes(self, batch: List[numpy.array]) -> int:
+    """Returns the number of bytes of data for a batch."""
+    return sum(sys.getsizeof(element) for element in batch)
+
+
+class SklearnModelLoader(ModelLoader):
+  def __init__(
+      self,
+      model_file_type: ModelFileType = ModelFileType.PICKLE,
+      model_uri: str = ''):
+    self._model_file_type = model_file_type
+    self._model_uri = model_uri
+    self._inference_runner = SklearnInferenceRunner()
+
+  def load_model(self):
+    """Loads and initializes a model for processing."""
+    file = FileSystems.open(self._model_uri, 'rb')
+    if self._model_file_type == ModelFileType.PICKLE:
+      return pickle.load(file)
+    elif self._model_file_type == ModelFileType.JOBLIB:
+      if not joblib:
+        raise ImportError('Joblib not available in SklearnModelLoader.')

Review Comment:
   Are you saying that we should not crash here? In my opinion, if this requirement is missing the pipeline should fail and crash since this whole transform will simply not work. This is consistent with when other transforms fail.
   
   I made this error message a little more consistent, and pointed to docs.
   
   Other import errors:
   
   https://github.com/apache/beam/blob/3f2e3c7c9eccb9d40370cbc70e9a451a4b5573f5/sdks/python/apache_beam/ml/gcp/visionml.py#L41
   
   https://github.com/apache/beam/blob/3f2e3c7c9eccb9d40370cbc70e9a451a4b5573f5/sdks/python/apache_beam/runners/dataflow/dataflow_metrics.py#L298
    
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] asf-ci commented on pull request #17368: [BEAM-13983] Sklearn loader

Posted by GitBox <gi...@apache.org>.

asf-ci commented on PR #17368:
URL: https://github.com/apache/beam/pull/17368#issuecomment-1098497726

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] ryanthompson591 commented on a diff in pull request #17368: [BEAM-13983] Sklearn Loader for RunInference

Posted by GitBox <gi...@apache.org>.

ryanthompson591 commented on code in PR #17368:
URL: https://github.com/apache/beam/pull/17368#discussion_r851328207


##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,73 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import abc
+import enum
+import pickle
+import sys
+from dataclasses import dataclass
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import joblib
+import numpy
+
+import apache_beam.ml.inference.api as api
+import apache_beam.ml.inference.base as base
+import sklearn_loader
+from apache_beam.io.filesystems import FileSystems
+
+
+class SerializationType(enum.Enum):
+  PICKLE = 1
+  JOBLIB = 2
+
+
+class SKLearnInferenceRunner(base.InferenceRunner):
+  def run_inference(self, batch: List[numpy.array],
+                    model: Any) -> Iterable[numpy.array]:
+    # vectorize data for better performance
+    vectorized_batch = numpy.stack(batch, axis=0)
+    predictions = model.predict(vectorized_batch)
+    return [api.PredictionResult(x, y) for x, y in zip(batch, predictions)]
+
+  def get_num_bytes(self, batch: List[numpy.array]) -> int:
+    """Returns the number of bytes of data for a batch."""
+    return sum(sys.getsizeof(element) for element in batch)
+
+
+class SKLearnModelLoader(base.ModelLoader):
+  def __init__(
+      self,
+      serialization: SerializationType = SerializationType.PICKLE,
+      model_uri: str = ''):
+    self._serialization = serialization
+    self._model_uri = model_uri

Review Comment:
   I think it's fine to use different parameters. In my view, path communicates more a directory, and URI a single file.
   
   TFX-BSL is using saved_model_spec (which is a proto) that contains model_path. As far as I can tell, TF saves models to a path rather than a URI or file.
   https://cloud.google.com/blog/topics/developers-practitioners/using-tfx-inference-dataflow-large-scale-ml-inference-patterns
   
   What does pytorch do? Is it a single file or a path with a bunch of data?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] ryanthompson591 commented on a diff in pull request #17368: [BEAM-13983] Sklearn Loader for RunInference

Posted by GitBox <gi...@apache.org>.

ryanthompson591 commented on code in PR #17368:
URL: https://github.com/apache/beam/pull/17368#discussion_r851355795


##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,73 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import abc
+import enum
+import pickle
+import sys
+from dataclasses import dataclass
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import joblib
+import numpy
+
+import apache_beam.ml.inference.api as api
+import apache_beam.ml.inference.base as base
+import sklearn_loader
+from apache_beam.io.filesystems import FileSystems
+
+
+class SerializationType(enum.Enum):
+  PICKLE = 1
+  JOBLIB = 2
+
+
+class SKLearnInferenceRunner(base.InferenceRunner):
+  def run_inference(self, batch: List[numpy.array],
+                    model: Any) -> Iterable[numpy.array]:
+    # vectorize data for better performance
+    vectorized_batch = numpy.stack(batch, axis=0)
+    predictions = model.predict(vectorized_batch)
+    return [api.PredictionResult(x, y) for x, y in zip(batch, predictions)]
+
+  def get_num_bytes(self, batch: List[numpy.array]) -> int:
+    """Returns the number of bytes of data for a batch."""
+    return sum(sys.getsizeof(element) for element in batch)

Review Comment:
   Yeah, you pretty much got it. There is a little overhead that getsizeof measures, but this would be similar to what serialization does.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] ryanthompson591 commented on a diff in pull request #17368: [BEAM-13983] Sklearn Loader for RunInference

Posted by GitBox <gi...@apache.org>.

ryanthompson591 commented on code in PR #17368:
URL: https://github.com/apache/beam/pull/17368#discussion_r853147297


##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,73 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import abc
+import enum
+import pickle
+import sys
+from dataclasses import dataclass
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import joblib
+import numpy
+
+import apache_beam.ml.inference.api as api
+import apache_beam.ml.inference.base as base

Review Comment:
   Done, I prefer the absolute imports in this case as I think they're more clear.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] TheNeuralBit commented on a diff in pull request #17368: [BEAM-13983] Sklearn Loader for RunInference

Posted by GitBox <gi...@apache.org>.

TheNeuralBit commented on code in PR #17368:
URL: https://github.com/apache/beam/pull/17368#discussion_r854308117


##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,71 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import enum
+import pickle
+import sys
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import joblib

Review Comment:
   I think should we catch the ImportError on this since it's an optional dependency. We should only fail if the user tries to load a model with `ModelFileType.JOBLIB` and joblib failed to import.



##########
sdks/python/setup.py:
##########
@@ -159,6 +159,7 @@ def get_version():
 
 REQUIRED_TEST_PACKAGES = [
     'freezegun>=0.3.12',
+    'joblib>=1.1.0',

Review Comment:
   > It is in required packages
   
   I think Andy is referring to this when he says `REQUIRED_PACKAGES`: https://github.com/apache/beam/blob/e4d2050ccbaafb90428ab6c0cc494039f6282dae/sdks/python/setup.py#L123-L152
   
   AFAICT joblib isn't there or anywhere else in setup.py. What do you mean by that?
   
   Regardless, I think it's appropriate to just add joblib in the test packages, since it's an optional dependency (most Beam users can get along without it, and even SklearnRunInference users can get along without it, unless they change the default `model_file_type` to joblib). That being said for an optional dependency, we may want to be more lenient (see my next comment)



##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,73 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import abc
+import enum
+import pickle
+import sys
+from dataclasses import dataclass
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import joblib
+import numpy
+
+import apache_beam.ml.inference.api as api
+import apache_beam.ml.inference.base as base
+import sklearn_loader
+from apache_beam.io.filesystems import FileSystems
+
+
+class SerializationType(enum.Enum):
+  PICKLE = 1
+  JOBLIB = 2
+
+
+class SKLearnInferenceRunner(base.InferenceRunner):
+  def run_inference(self, batch: List[numpy.array],
+                    model: Any) -> Iterable[numpy.array]:
+    # vectorize data for better performance
+    vectorized_batch = numpy.stack(batch, axis=0)
+    predictions = model.predict(vectorized_batch)
+    return [api.PredictionResult(x, y) for x, y in zip(batch, predictions)]
+
+  def get_num_bytes(self, batch: List[numpy.array]) -> int:
+    """Returns the number of bytes of data for a batch."""
+    return sum(sys.getsizeof(element) for element in batch)
+
+
+class SKLearnModelLoader(base.ModelLoader):
+  def __init__(
+      self,
+      serialization: SerializationType = SerializationType.PICKLE,
+      model_uri: str = ''):
+    self._serialization = serialization
+    self._model_uri = model_uri

Review Comment:
   It sounds like a single filepath is the appropriate interface for both sklearn and pytorch, why not be consistent?



##########
sdks/python/setup.py:
##########
@@ -169,6 +170,7 @@ def get_version():
     'pytest>=4.4.0,<5.0',
     'pytest-xdist>=1.29.0,<2',
     'pytest-timeout>=1.3.3,<2',
+    'scikit-learn>=0.24.2',

Review Comment:
   Is there a reason for the lower bound on the sklearn version? If this doesn't work with earlier versions we should make sure to communicate that somehow.



##########
sdks/python/apache_beam/ml/inference/sklearn_loader_test.py:
##########
@@ -151,6 +151,12 @@ def test_bad_file_raises(self):
             SklearnModelLoader(model_uri='/var/bad_file_name'))
         pipeline.run()
 
+  def test_bad_input_type_raises(self):
+    with tempfile.NamedTemporaryFile() as file:
+      with self.assertRaises(TypeError):

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] TheNeuralBit commented on a diff in pull request #17368: [BEAM-13983] Sklearn Loader for RunInference

Posted by GitBox <gi...@apache.org>.

TheNeuralBit commented on code in PR #17368:
URL: https://github.com/apache/beam/pull/17368#discussion_r857832187


##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,78 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import enum
+import pickle
+import sys
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import numpy
+
+from apache_beam.io.filesystems import FileSystems
+from apache_beam.ml.inference.api import PredictionResult
+from apache_beam.ml.inference.base import InferenceRunner
+from apache_beam.ml.inference.base import ModelLoader
+
+try:
+  import joblib
+except ImportError:
+  # joblib is an optional dependency.
+  pass
+
+
+class ModelFileType(enum.Enum):
+  PICKLE = 1
+  JOBLIB = 2
+
+
+class SklearnInferenceRunner(InferenceRunner):
+  def run_inference(self, batch: List[numpy.array],
+                    model: Any) -> Iterable[numpy.array]:
+    # vectorize data for better performance
+    vectorized_batch = numpy.stack(batch, axis=0)
+    predictions = model.predict(vectorized_batch)
+    return [PredictionResult(x, y) for x, y in zip(batch, predictions)]
+
+  def get_num_bytes(self, batch: List[numpy.array]) -> int:
+    """Returns the number of bytes of data for a batch."""
+    return sum(sys.getsizeof(element) for element in batch)
+
+
+class SklearnModelLoader(ModelLoader):
+  def __init__(
+      self,
+      model_file_type: ModelFileType = ModelFileType.PICKLE,
+      model_uri: str = ''):
+    self._model_file_type = model_file_type
+    self._model_uri = model_uri
+    self._inference_runner = SklearnInferenceRunner()
+
+  def load_model(self):
+    """Loads and initializes a model for processing."""
+    file = FileSystems.open(self._model_uri, 'rb')
+    if self._model_file_type == ModelFileType.PICKLE:
+      return pickle.load(file)
+    elif self._model_file_type == ModelFileType.JOBLIB:
+      if not joblib:
+        raise ImportError('Joblib not available in SklearnModelLoader.')

Review Comment:
   No I wasn't suggesting we shouldn't crash here, we don't have a choice :)
   
   The options to improve would be:
   - fail faster, ideally at pipeline construction time - there's not great options for this since we don't know what dependencies will be installed on the worker then.
   - make the execution time error very clear and helpful (what I'm suggesting).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org