You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/07/21 20:53:25 UTC

[GitHub] [beam] fabito opened a new issue, #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API

fabito opened a new issue, #22402:
URL: https://github.com/apache/beam/issues/22402

   ### What would you like to happen?
   
   I am running a pipeline to extract image embeddings using `open-clip-torch` and the RunInference API in Dataflow.
   Sometimes, specially when the `DataflowRunner` triggers a scale up, we get unhealthy workers due to corrupted model files.
   Whenever that happens the whole job fails.
   Would it be possible to detect corrupted model files and reload them ?
   For more details see the log below:
   
   ```
   Traceback (most recent call last):
     File "/venv/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 284, in _execute
       response = task()
     File "/venv/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 357, in <lambda>
       lambda: self.create_worker().do_instruction(request), request)
     File "/venv/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 597, in do_instruction
       return getattr(self, request_type)(
     File "/venv/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 628, in process_bundle
       bundle_processor = self.bundle_processor_cache.get(
     File "/venv/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 458, in get
       processor = bundle_processor.BundleProcessor(
     File "/venv/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py", line 873, in __init__
       op.setup()
     File "apache_beam/runners/worker/operations.py", line 833, in apache_beam.runners.worker.operations.DoOperation.setup
     File "apache_beam/runners/worker/operations.py", line 882, in apache_beam.runners.worker.operations.DoOperation.setup
     File "apache_beam/runners/common.py", line 1471, in apache_beam.runners.common.DoFnRunner.setup
     File "apache_beam/runners/common.py", line 1467, in apache_beam.runners.common.DoFnRunner._invoke_lifecycle_method
     File "apache_beam/runners/common.py", line 1507, in apache_beam.runners.common.DoFnRunner._reraise_augmented
     File "apache_beam/runners/common.py", line 1465, in apache_beam.runners.common.DoFnRunner._invoke_lifecycle_method
     File "apache_beam/runners/common.py", line 551, in apache_beam.runners.common.DoFnInvoker.invoke_setup
     File "/venv/lib/python3.8/site-packages/apache_beam/ml/inference/base.py", line 374, in setup
       self._model = self._load_model()
     File "/venv/lib/python3.8/site-packages/apache_beam/ml/inference/base.py", line 369, in _load_model
       return self._shared_model_handle.acquire(load)
     File "/venv/lib/python3.8/site-packages/apache_beam/utils/shared.py", line 305, in acquire
       return _shared_map.acquire(self._key, constructor_fn, tag)
     File "/venv/lib/python3.8/site-packages/apache_beam/utils/shared.py", line 246, in acquire
       result = control_block.acquire(constructor_fn, tag)
     File "/venv/lib/python3.8/site-packages/apache_beam/utils/shared.py", line 139, in acquire
       result = constructor_fn()
     File "/venv/lib/python3.8/site-packages/apache_beam/ml/inference/base.py", line 358, in load
       model = self._model_handler.load_model()
     File "/venv/lib/python3.8/site-packages/apache_beam/ml/inference/base.py", line 146, in load_model
       return self._unkeyed.load_model()
     File "/venv/lib/python3.8/site-packages/conjurer/feature_extractor/embedder/clip.py", line 51, in load_model
       model = open_clip.create_model(self.model_name, pretrained=self.pretrained, device=self._device)
     File "/venv/lib/python3.8/site-packages/open_clip/factory.py", line 108, in create_model
       model.load_state_dict(load_state_dict(checkpoint_path))
     File "/venv/lib/python3.8/site-packages/open_clip/factory.py", line 50, in load_state_dict
       checkpoint = torch.load(checkpoint_path, map_location=map_location)
     File "/venv/lib/python3.8/site-packages/torch/serialization.py", line 713, in load
       return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
     File "/venv/lib/python3.8/site-packages/torch/serialization.py", line 938, in _legacy_load
       typed_storage._storage._set_from_file(
   RuntimeError: unexpected EOF, expected 1443121 more bytes. The file might be corrupted. [while running 'OpenClipEmbedder(ViT-B-32-quickgelu)/PyTorchRunInference/ParDo(_RunInferenceDoFn)-ptransform-122']
   ```
   
   ### Issue Priority
   
   Priority: 3
   
   ### Issue Component
   
   Component: sdk-py-core


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] tvalentyn commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API

Posted by GitBox <gi...@apache.org>.

tvalentyn commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201533720

   @pranavbhandari24  also reported the following error:
   
   ```
   Error message from worker: Traceback (most recent call last):
     File "apache_beam/runners/common.py", line 1465, in apache_beam.runners.common.DoFnRunner._invoke_lifecycle_method
     File "apache_beam/runners/common.py", line 551, in apache_beam.runners.common.DoFnInvoker.invoke_setup
     File "/usr/local/lib/python3.7/site-packages/apache_beam/ml/inference/base.py", line 374, in setup
       self._model = self._load_model()
     File "/usr/local/lib/python3.7/site-packages/apache_beam/ml/inference/base.py", line 369, in _load_model
       return self._shared_model_handle.acquire(load)
     File "/usr/local/lib/python3.7/site-packages/apache_beam/utils/shared.py", line 305, in acquire
       return _shared_map.acquire(self._key, constructor_fn, tag)
     File "/usr/local/lib/python3.7/site-packages/apache_beam/utils/shared.py", line 246, in acquire
       result = control_block.acquire(constructor_fn, tag)
     File "/usr/local/lib/python3.7/site-packages/apache_beam/utils/shared.py", line 139, in acquire
       result = constructor_fn()
     File "/usr/local/lib/python3.7/site-packages/apache_beam/ml/inference/base.py", line 358, in load
       model = self._model_handler.load_model()
     File "/usr/local/lib/python3.7/site-packages/apache_beam/ml/inference/sklearn_inference.py", line 100, in load_model
       return _load_model(self._model_uri, self._model_file_type)
     File "/usr/local/lib/python3.7/site-packages/apache_beam/ml/inference/sklearn_inference.py", line 51, in _load_model
       return pickle.load(file)
   _pickle.UnpicklingError: pickle data was truncated
   ```
   
   We originally thought that the model was corrupted at storage location, but maybe we have an issue with loading the model over Beam IOs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] fabito commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API

Posted by GitBox <gi...@apache.org>.

fabito commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201765621

   > > Yes I have it: 2022-07-21_18_25_09-17162064527459510398
   > 
   > Let me know if we have permission to investigate this job.
   
   Yes, proceed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] tvalentyn commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API

Posted by GitBox <gi...@apache.org>.

tvalentyn commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201797610

   Sounds reasonable. Ok. Sounds like this is not actionable for Beam.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] fabito commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API

Posted by GitBox <gi...@apache.org>.

fabito commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201770776

   > I am mainly interested how the job was failing in face of the expected retry behavior: were the errors identical during each iteration?
   
   I am not 100% sure how to interpret the logs.
   I can see 4 identical errors, do they represent the retry attempts?
   
   ![Screenshot from 2022-08-02 10-02-26](https://user-images.githubusercontent.com/308613/182253792-ad0c27f6-b3ab-4360-8197-23a173f344b4.png)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] tvalentyn commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API

Posted by GitBox <gi...@apache.org>.

tvalentyn commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201757722

   @fabito do you have a Dataflow Job ID we could investigate?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] tvalentyn commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API

Posted by GitBox <gi...@apache.org>.

tvalentyn commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201530237

   Thanks for reporting and adding the run-inference label.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] fabito commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API

Posted by GitBox <gi...@apache.org>.

fabito commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201777279

   Does the pre trained model gets re downloaded before each attempt ? (it doens't look like)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] tvalentyn commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API

Posted by GitBox <gi...@apache.org>.

tvalentyn commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201799018

   Given that you already seem to be using a custom container, I would go for the last option.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] github-actions[bot] commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1191927168

   Label run-inrefence cannot be managed because it does not exist in the repo. Please check your spelling.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] fabito commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API

Posted by GitBox <gi...@apache.org>.

fabito commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201787189

   > I wonder if the file the pipeline is trying to read is still around in GCS or its location. I wonder if we can check whether the file didn't get modified at rest.
   
   The models not stored in GCS (well they could be), they are actually fetched from public URIs:
   
   https://github.com/mlfoundations/open_clip/blob/15bb1f7347acdfc2e5e3069455a84fbd188aa4f2/src/open_clip/pretrained.py#L52


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] tvalentyn commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API

Posted by GitBox <gi...@apache.org>.

tvalentyn commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201764445

   Let me know if we have permission to investigate this job.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] fabito commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API

Posted by GitBox <gi...@apache.org>.

fabito commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201796326

   Interesting.
   Looking at the `download_pretrained` function, looks like it caches the file in a local dir `~/.cache/clip`.
   In addition to that it doesn't always perform the SHA256 checksum check.
   
   https://github.com/mlfoundations/open_clip/blob/15bb1f7347acdfc2e5e3069455a84fbd188aa4f2/src/open_clip/pretrained.py#L127
   
   It could be probably solved by: enabling checksum (preferrable), disabling local caching, or embedding the pre trained model in the docker image.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] fabito commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API

Posted by GitBox <gi...@apache.org>.

fabito commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1191932838

   .add-labels run-inference


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] tvalentyn commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API

Posted by GitBox <gi...@apache.org>.

tvalentyn commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201764237

   With your permission, I can work with Dataflow support to help take a look. I am mainly interested how the job was failing in face of the expected retry behavior: were the errors identical during each iteration?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] tvalentyn commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API

Posted by GitBox <gi...@apache.org>.

tvalentyn commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201534564

   The workers should retry at least 4 times though before failing the job.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] tvalentyn commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API

Posted by GitBox <gi...@apache.org>.

tvalentyn commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201791011

   As far as RunInference is concerned, we call `OpenClipTorchModelHandlerTensor.load_model()` each time upon retry. Looking at the stacktrace, that's where the error is coming from. 
   
   ```
   2022-07-21 18:46:51.096 PDT
    model = open_clip.create_model(self.model_name, pretrained=self.pretrained, device=self._device)
   2022-07-21 18:46:51.096 PDT
    File "/venv/lib/python3.8/site-packages/open_clip/factory.py", line 108, in create_model
   2022-07-21 18:46:51.096 PDT
    model.load_state_dict(load_state_dict(checkpoint_path))
   2022-07-21 18:46:51.096 PDT
    File "/venv/lib/python3.8/site-packages/open_clip/factory.py", line 50, in load_state_dict
   2022-07-21 18:46:51.096 PDT
    checkpoint = torch.load(checkpoint_path, map_location=map_location)
   2022-07-21 18:46:51.096 PDT
    File "/venv/lib/python3.8/site-packages/torch/serialization.py", line 713, in load
   2022-07-21 18:46:51.096 PDT
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
   2022-07-21 18:46:51.096 PDT
    File "/venv/lib/python3.8/site-packages/torch/serialization.py", line 938, in _legacy_load
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] tvalentyn commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API

Posted by GitBox <gi...@apache.org>.

tvalentyn commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201774123

   I wonder if the file the pipeline is trying to read is still around in GCS or its location. I wonder if we can check whether the file didn't get modified at rest.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] tvalentyn commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API

Posted by GitBox <gi...@apache.org>.

tvalentyn commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201800686

   Thanks for trying RunInference. Feel free to pass any other feedback you have. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] fabito commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API

Posted by GitBox <gi...@apache.org>.

fabito commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1191926135

   .add-labels run-inrefence


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] fabito commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API

Posted by GitBox <gi...@apache.org>.

fabito commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201757286

   > If you're using GPU, can you also provide information on your setup such as the type of GPU, # of GPUs, type of machine?
   
   I am not using GPU. The embedding extraction pipeline is running on Dataflow using:
   
   ```
   --region=europe-west4 
   --flexrs_goal=COST_OPTIMIZED 
   --machine_type=n1-standard-2
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] yeandy commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API

Posted by GitBox <gi...@apache.org>.

yeandy commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201665976

   Thanks @fabito for the report. Assuming file integrity is not the issue, then we may need to investigate our model-loading logic. A few questions:
   
   1. Did you test to see if this also occurs with `DirectRunner` (or other runners)? Does it only occur with `DataflowRunner`?
   2. Can you please provide the values for `self.model_name`, `self.pretrained`, `self._device`? If you're using GPU, can you also provide information on your setup such as the type of GPU, # of GPUs, type of machine?
   
   @pranavbhandari24 
   Did you specify anything for the `--pickle_library` pipeline option? Or are you just using default?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] tvalentyn closed issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API

Posted by GitBox <gi...@apache.org>.

tvalentyn closed issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
URL: https://github.com/apache/beam/issues/22402


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] tvalentyn commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API

Posted by GitBox <gi...@apache.org>.

tvalentyn commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201773102

   Yes, they are 4 retry attempts and every time they fail with the same `RuntimeError: unexpected EOF, expected 1610513 more bytes.`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] tvalentyn commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API

Posted by GitBox <gi...@apache.org>.

tvalentyn commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201535338

   cc: @yeandy @ryanthompson591 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] fabito commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API

Posted by GitBox <gi...@apache.org>.

fabito commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201754823

   Hi @yeandy ,
   
   > Can you please provide the values for self.model_name, self.pretrained, self._device?
   
   Here is my custom implementation of `ModelHandler` (nearly a copy of the existing TorchModelHandler):
   
   ```python
   class OpenClipTorchModelHandlerTensor(ModelHandler[torch.Tensor, PredictionResult, torch.nn.Module]):
   
       def __init__(self, model_name: str = 'ViT-B-32-quickgelu', pretrained: str = 'laion400m_e32', device: str = 'CPU'):
           self.pretrained = pretrained
           self.model_name = model_name
           if device == 'GPU' and torch.cuda.is_available():
               self._device = torch.device('cuda')
           else:
               self._device = torch.device('cpu')
   
           if model_name in open_clip.factory._MODEL_CONFIGS:
               logging.info(f'Loading {model_name} model config.')
               self.model_cfg = deepcopy(open_clip.factory._MODEL_CONFIGS[model_name])
           else:
               raise ValueError('Invalid open clip model name')
   
       def load_model(self) -> torch.nn.Module:
           model = open_clip.create_model(self.model_name, pretrained=self.pretrained, device=self._device)
           return model.visual
   
       def run_inference(
               self,
               batch: Sequence[torch.Tensor],
               model: torch.nn.Module,
               inference_args: Optional[Dict[str, Any]] = None
       ) -> Iterable[PredictionResult]:
           batched_tensors = torch.stack(batch)
           batched_tensors = _convert_to_device(batched_tensors, self._device)
           with torch.no_grad():
               predictions = model(batched_tensors)
           return [PredictionResult(x, y) for x, y in zip(batch, predictions)]
   
       def get_num_bytes(self, batch: Sequence[torch.Tensor]) -> int:
           """
           Returns:
               The number of bytes of data for a batch of Tensors.
           """
           return sum((el.element_size() for tensor in batch for el in tensor))
   
       def get_metrics_namespace(self) -> str:
           """
           Returns:
              A namespace for metrics collected by the RunInference transform.
           """
           return 'RunInferenceOpenClipTorch'
   
       def preprocess_transform(self) -> transforms.Compose:
           image_size = self.model_cfg['vision_cfg']['image_size']
           return open_clip.transform.image_transform(image_size, False)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] fabito commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API

Posted by GitBox <gi...@apache.org>.

fabito commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201761644

   > @fabito do you have a Dataflow Job ID we could investigate?
   
   Yes I heve it. But how do you plan to access it ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] fabito commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API

Posted by GitBox <gi...@apache.org>.

fabito commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201799780

   Makes sense! Thanks for helping investigating.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org