You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/07/21 20:53:25 UTC
[GitHub] [beam] fabito opened a new issue, #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
fabito opened a new issue, #22402:
URL: https://github.com/apache/beam/issues/22402
### What would you like to happen?
I am running a pipeline to extract image embeddings using `open-clip-torch` and the RunInference API in Dataflow.
Sometimes, specially when the `DataflowRunner` triggers a scale up, we get unhealthy workers due to corrupted model files.
Whenever that happens the whole job fails.
Would it be possible to detect corrupted model files and reload them ?
For more details see the log below:
```
Traceback (most recent call last):
File "/venv/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 284, in _execute
response = task()
File "/venv/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 357, in <lambda>
lambda: self.create_worker().do_instruction(request), request)
File "/venv/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 597, in do_instruction
return getattr(self, request_type)(
File "/venv/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 628, in process_bundle
bundle_processor = self.bundle_processor_cache.get(
File "/venv/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 458, in get
processor = bundle_processor.BundleProcessor(
File "/venv/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py", line 873, in __init__
op.setup()
File "apache_beam/runners/worker/operations.py", line 833, in apache_beam.runners.worker.operations.DoOperation.setup
File "apache_beam/runners/worker/operations.py", line 882, in apache_beam.runners.worker.operations.DoOperation.setup
File "apache_beam/runners/common.py", line 1471, in apache_beam.runners.common.DoFnRunner.setup
File "apache_beam/runners/common.py", line 1467, in apache_beam.runners.common.DoFnRunner._invoke_lifecycle_method
File "apache_beam/runners/common.py", line 1507, in apache_beam.runners.common.DoFnRunner._reraise_augmented
File "apache_beam/runners/common.py", line 1465, in apache_beam.runners.common.DoFnRunner._invoke_lifecycle_method
File "apache_beam/runners/common.py", line 551, in apache_beam.runners.common.DoFnInvoker.invoke_setup
File "/venv/lib/python3.8/site-packages/apache_beam/ml/inference/base.py", line 374, in setup
self._model = self._load_model()
File "/venv/lib/python3.8/site-packages/apache_beam/ml/inference/base.py", line 369, in _load_model
return self._shared_model_handle.acquire(load)
File "/venv/lib/python3.8/site-packages/apache_beam/utils/shared.py", line 305, in acquire
return _shared_map.acquire(self._key, constructor_fn, tag)
File "/venv/lib/python3.8/site-packages/apache_beam/utils/shared.py", line 246, in acquire
result = control_block.acquire(constructor_fn, tag)
File "/venv/lib/python3.8/site-packages/apache_beam/utils/shared.py", line 139, in acquire
result = constructor_fn()
File "/venv/lib/python3.8/site-packages/apache_beam/ml/inference/base.py", line 358, in load
model = self._model_handler.load_model()
File "/venv/lib/python3.8/site-packages/apache_beam/ml/inference/base.py", line 146, in load_model
return self._unkeyed.load_model()
File "/venv/lib/python3.8/site-packages/conjurer/feature_extractor/embedder/clip.py", line 51, in load_model
model = open_clip.create_model(self.model_name, pretrained=self.pretrained, device=self._device)
File "/venv/lib/python3.8/site-packages/open_clip/factory.py", line 108, in create_model
model.load_state_dict(load_state_dict(checkpoint_path))
File "/venv/lib/python3.8/site-packages/open_clip/factory.py", line 50, in load_state_dict
checkpoint = torch.load(checkpoint_path, map_location=map_location)
File "/venv/lib/python3.8/site-packages/torch/serialization.py", line 713, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/venv/lib/python3.8/site-packages/torch/serialization.py", line 938, in _legacy_load
typed_storage._storage._set_from_file(
RuntimeError: unexpected EOF, expected 1443121 more bytes. The file might be corrupted. [while running 'OpenClipEmbedder(ViT-B-32-quickgelu)/PyTorchRunInference/ParDo(_RunInferenceDoFn)-ptransform-122']
```
### Issue Priority
Priority: 3
### Issue Component
Component: sdk-py-core
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] tvalentyn commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201533720
@pranavbhandari24 also reported the following error:
```
Error message from worker: Traceback (most recent call last):
File "apache_beam/runners/common.py", line 1465, in apache_beam.runners.common.DoFnRunner._invoke_lifecycle_method
File "apache_beam/runners/common.py", line 551, in apache_beam.runners.common.DoFnInvoker.invoke_setup
File "/usr/local/lib/python3.7/site-packages/apache_beam/ml/inference/base.py", line 374, in setup
self._model = self._load_model()
File "/usr/local/lib/python3.7/site-packages/apache_beam/ml/inference/base.py", line 369, in _load_model
return self._shared_model_handle.acquire(load)
File "/usr/local/lib/python3.7/site-packages/apache_beam/utils/shared.py", line 305, in acquire
return _shared_map.acquire(self._key, constructor_fn, tag)
File "/usr/local/lib/python3.7/site-packages/apache_beam/utils/shared.py", line 246, in acquire
result = control_block.acquire(constructor_fn, tag)
File "/usr/local/lib/python3.7/site-packages/apache_beam/utils/shared.py", line 139, in acquire
result = constructor_fn()
File "/usr/local/lib/python3.7/site-packages/apache_beam/ml/inference/base.py", line 358, in load
model = self._model_handler.load_model()
File "/usr/local/lib/python3.7/site-packages/apache_beam/ml/inference/sklearn_inference.py", line 100, in load_model
return _load_model(self._model_uri, self._model_file_type)
File "/usr/local/lib/python3.7/site-packages/apache_beam/ml/inference/sklearn_inference.py", line 51, in _load_model
return pickle.load(file)
_pickle.UnpicklingError: pickle data was truncated
```
We originally thought that the model was corrupted at storage location, but maybe we have an issue with loading the model over Beam IOs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] fabito commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
Posted by GitBox <gi...@apache.org>.
fabito commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201765621
> > Yes I have it: 2022-07-21_18_25_09-17162064527459510398
>
> Let me know if we have permission to investigate this job.
Yes, proceed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] tvalentyn commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201797610
Sounds reasonable. Ok. Sounds like this is not actionable for Beam.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] fabito commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
Posted by GitBox <gi...@apache.org>.
fabito commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201770776
> I am mainly interested how the job was failing in face of the expected retry behavior: were the errors identical during each iteration?
I am not 100% sure how to interpret the logs.
I can see 4 identical errors, do they represent the retry attempts?
![Screenshot from 2022-08-02 10-02-26](https://user-images.githubusercontent.com/308613/182253792-ad0c27f6-b3ab-4360-8197-23a173f344b4.png)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] tvalentyn commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201757722
@fabito do you have a Dataflow Job ID we could investigate?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] tvalentyn commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201530237
Thanks for reporting and adding the run-inference label.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] fabito commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
Posted by GitBox <gi...@apache.org>.
fabito commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201777279
Does the pre trained model gets re downloaded before each attempt ? (it doens't look like)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] tvalentyn commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201799018
Given that you already seem to be using a custom container, I would go for the last option.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] github-actions[bot] commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1191927168
Label run-inrefence cannot be managed because it does not exist in the repo. Please check your spelling.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] fabito commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
Posted by GitBox <gi...@apache.org>.
fabito commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201787189
> I wonder if the file the pipeline is trying to read is still around in GCS or its location. I wonder if we can check whether the file didn't get modified at rest.
The models not stored in GCS (well they could be), they are actually fetched from public URIs:
https://github.com/mlfoundations/open_clip/blob/15bb1f7347acdfc2e5e3069455a84fbd188aa4f2/src/open_clip/pretrained.py#L52
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] tvalentyn commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201764445
Let me know if we have permission to investigate this job.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] fabito commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
Posted by GitBox <gi...@apache.org>.
fabito commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201796326
Interesting.
Looking at the `download_pretrained` function, looks like it caches the file in a local dir `~/.cache/clip`.
In addition to that it doesn't always perform the SHA256 checksum check.
https://github.com/mlfoundations/open_clip/blob/15bb1f7347acdfc2e5e3069455a84fbd188aa4f2/src/open_clip/pretrained.py#L127
It could be probably solved by: enabling checksum (preferrable), disabling local caching, or embedding the pre trained model in the docker image.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] fabito commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
Posted by GitBox <gi...@apache.org>.
fabito commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1191932838
.add-labels run-inference
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] tvalentyn commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201764237
With your permission, I can work with Dataflow support to help take a look. I am mainly interested how the job was failing in face of the expected retry behavior: were the errors identical during each iteration?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] tvalentyn commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201534564
The workers should retry at least 4 times though before failing the job.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] tvalentyn commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201791011
As far as RunInference is concerned, we call `OpenClipTorchModelHandlerTensor.load_model()` each time upon retry. Looking at the stacktrace, that's where the error is coming from.
```
2022-07-21 18:46:51.096 PDT
model = open_clip.create_model(self.model_name, pretrained=self.pretrained, device=self._device)
2022-07-21 18:46:51.096 PDT
File "/venv/lib/python3.8/site-packages/open_clip/factory.py", line 108, in create_model
2022-07-21 18:46:51.096 PDT
model.load_state_dict(load_state_dict(checkpoint_path))
2022-07-21 18:46:51.096 PDT
File "/venv/lib/python3.8/site-packages/open_clip/factory.py", line 50, in load_state_dict
2022-07-21 18:46:51.096 PDT
checkpoint = torch.load(checkpoint_path, map_location=map_location)
2022-07-21 18:46:51.096 PDT
File "/venv/lib/python3.8/site-packages/torch/serialization.py", line 713, in load
2022-07-21 18:46:51.096 PDT
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
2022-07-21 18:46:51.096 PDT
File "/venv/lib/python3.8/site-packages/torch/serialization.py", line 938, in _legacy_load
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] tvalentyn commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201774123
I wonder if the file the pipeline is trying to read is still around in GCS or its location. I wonder if we can check whether the file didn't get modified at rest.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] tvalentyn commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201800686
Thanks for trying RunInference. Feel free to pass any other feedback you have.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] fabito commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
Posted by GitBox <gi...@apache.org>.
fabito commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1191926135
.add-labels run-inrefence
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] fabito commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
Posted by GitBox <gi...@apache.org>.
fabito commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201757286
> If you're using GPU, can you also provide information on your setup such as the type of GPU, # of GPUs, type of machine?
I am not using GPU. The embedding extraction pipeline is running on Dataflow using:
```
--region=europe-west4
--flexrs_goal=COST_OPTIMIZED
--machine_type=n1-standard-2
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] yeandy commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
Posted by GitBox <gi...@apache.org>.
yeandy commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201665976
Thanks @fabito for the report. Assuming file integrity is not the issue, then we may need to investigate our model-loading logic. A few questions:
1. Did you test to see if this also occurs with `DirectRunner` (or other runners)? Does it only occur with `DataflowRunner`?
2. Can you please provide the values for `self.model_name`, `self.pretrained`, `self._device`? If you're using GPU, can you also provide information on your setup such as the type of GPU, # of GPUs, type of machine?
@pranavbhandari24
Did you specify anything for the `--pickle_library` pipeline option? Or are you just using default?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] tvalentyn closed issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
Posted by GitBox <gi...@apache.org>.
tvalentyn closed issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
URL: https://github.com/apache/beam/issues/22402
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] tvalentyn commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201773102
Yes, they are 4 retry attempts and every time they fail with the same `RuntimeError: unexpected EOF, expected 1610513 more bytes.`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] tvalentyn commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201535338
cc: @yeandy @ryanthompson591
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] fabito commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
Posted by GitBox <gi...@apache.org>.
fabito commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201754823
Hi @yeandy ,
> Can you please provide the values for self.model_name, self.pretrained, self._device?
Here is my custom implementation of `ModelHandler` (nearly a copy of the existing TorchModelHandler):
```python
class OpenClipTorchModelHandlerTensor(ModelHandler[torch.Tensor, PredictionResult, torch.nn.Module]):
def __init__(self, model_name: str = 'ViT-B-32-quickgelu', pretrained: str = 'laion400m_e32', device: str = 'CPU'):
self.pretrained = pretrained
self.model_name = model_name
if device == 'GPU' and torch.cuda.is_available():
self._device = torch.device('cuda')
else:
self._device = torch.device('cpu')
if model_name in open_clip.factory._MODEL_CONFIGS:
logging.info(f'Loading {model_name} model config.')
self.model_cfg = deepcopy(open_clip.factory._MODEL_CONFIGS[model_name])
else:
raise ValueError('Invalid open clip model name')
def load_model(self) -> torch.nn.Module:
model = open_clip.create_model(self.model_name, pretrained=self.pretrained, device=self._device)
return model.visual
def run_inference(
self,
batch: Sequence[torch.Tensor],
model: torch.nn.Module,
inference_args: Optional[Dict[str, Any]] = None
) -> Iterable[PredictionResult]:
batched_tensors = torch.stack(batch)
batched_tensors = _convert_to_device(batched_tensors, self._device)
with torch.no_grad():
predictions = model(batched_tensors)
return [PredictionResult(x, y) for x, y in zip(batch, predictions)]
def get_num_bytes(self, batch: Sequence[torch.Tensor]) -> int:
"""
Returns:
The number of bytes of data for a batch of Tensors.
"""
return sum((el.element_size() for tensor in batch for el in tensor))
def get_metrics_namespace(self) -> str:
"""
Returns:
A namespace for metrics collected by the RunInference transform.
"""
return 'RunInferenceOpenClipTorch'
def preprocess_transform(self) -> transforms.Compose:
image_size = self.model_cfg['vision_cfg']['image_size']
return open_clip.transform.image_transform(image_size, False)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] fabito commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
Posted by GitBox <gi...@apache.org>.
fabito commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201761644
> @fabito do you have a Dataflow Job ID we could investigate?
Yes I heve it. But how do you plan to access it ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] fabito commented on issue #22402: [Feature Request]: Ability to detect (and maybe reload ?) corrupted models when using the RunInference API
Posted by GitBox <gi...@apache.org>.
fabito commented on issue #22402:
URL: https://github.com/apache/beam/issues/22402#issuecomment-1201799780
Makes sense! Thanks for helping investigating.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org