You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/04/20 16:08:04 UTC
[GitHub] [beam] TheNeuralBit commented on a diff in pull request #17368: [BEAM-13983] Sklearn Loader for RunInference

TheNeuralBit commented on code in PR #17368:
URL: https://github.com/apache/beam/pull/17368#discussion_r854308117


##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,71 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import enum
+import pickle
+import sys
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import joblib

Review Comment:
   I think should we catch the ImportError on this since it's an optional dependency. We should only fail if the user tries to load a model with `ModelFileType.JOBLIB` and joblib failed to import.



##########
sdks/python/setup.py:
##########
@@ -159,6 +159,7 @@ def get_version():
 
 REQUIRED_TEST_PACKAGES = [
     'freezegun>=0.3.12',
+    'joblib>=1.1.0',

Review Comment:
   > It is in required packages
   
   I think Andy is referring to this when he says `REQUIRED_PACKAGES`: https://github.com/apache/beam/blob/e4d2050ccbaafb90428ab6c0cc494039f6282dae/sdks/python/setup.py#L123-L152
   
   AFAICT joblib isn't there or anywhere else in setup.py. What do you mean by that?
   
   Regardless, I think it's appropriate to just add joblib in the test packages, since it's an optional dependency (most Beam users can get along without it, and even SklearnRunInference users can get along without it, unless they change the default `model_file_type` to joblib). That being said for an optional dependency, we may want to be more lenient (see my next comment)



##########
sdks/python/apache_beam/ml/inference/sklearn_loader.py:
##########
@@ -0,0 +1,73 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import abc
+import enum
+import pickle
+import sys
+from dataclasses import dataclass
+from typing import Any
+from typing import Iterable
+from typing import List
+
+import joblib
+import numpy
+
+import apache_beam.ml.inference.api as api
+import apache_beam.ml.inference.base as base
+import sklearn_loader
+from apache_beam.io.filesystems import FileSystems
+
+
+class SerializationType(enum.Enum):
+  PICKLE = 1
+  JOBLIB = 2
+
+
+class SKLearnInferenceRunner(base.InferenceRunner):
+  def run_inference(self, batch: List[numpy.array],
+                    model: Any) -> Iterable[numpy.array]:
+    # vectorize data for better performance
+    vectorized_batch = numpy.stack(batch, axis=0)
+    predictions = model.predict(vectorized_batch)
+    return [api.PredictionResult(x, y) for x, y in zip(batch, predictions)]
+
+  def get_num_bytes(self, batch: List[numpy.array]) -> int:
+    """Returns the number of bytes of data for a batch."""
+    return sum(sys.getsizeof(element) for element in batch)
+
+
+class SKLearnModelLoader(base.ModelLoader):
+  def __init__(
+      self,
+      serialization: SerializationType = SerializationType.PICKLE,
+      model_uri: str = ''):
+    self._serialization = serialization
+    self._model_uri = model_uri

Review Comment:
   It sounds like a single filepath is the appropriate interface for both sklearn and pytorch, why not be consistent?



##########
sdks/python/setup.py:
##########
@@ -169,6 +170,7 @@ def get_version():
     'pytest>=4.4.0,<5.0',
     'pytest-xdist>=1.29.0,<2',
     'pytest-timeout>=1.3.3,<2',
+    'scikit-learn>=0.24.2',

Review Comment:
   Is there a reason for the lower bound on the sklearn version? If this doesn't work with earlier versions we should make sure to communicate that somehow.



##########
sdks/python/apache_beam/ml/inference/sklearn_loader_test.py:
##########
@@ -151,6 +151,12 @@ def test_bad_file_raises(self):
             SklearnModelLoader(model_uri='/var/bad_file_name'))
         pipeline.run()
 
+  def test_bad_input_type_raises(self):
+    with tempfile.NamedTemporaryFile() as file:
+      with self.assertRaises(TypeError):

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org