You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@bluemarlin.apache.org by GitBox <gi...@apache.org> on 2022/02/21 13:45:44 UTC
[GitHub] [incubator-bluemarlin] Bimlesh759-AI opened a new issue #51: [BLUEMARLIN-28] : For DIN-Lookalike model, train.py runs only if line 40 in model.py is commented.
Bimlesh759-AI opened a new issue #51:
URL: https://github.com/apache/incubator-bluemarlin/issues/51
1. Training fails at the beginning itself if line 40 in model.py is not commented. If we comment line 40 in model.py then train.py runs successfully.
In below code taken from model.py, if we comment -> user_emb_w, then training is successful.
hidden_units = 128
user_emb_w = tf.get_variable("user_emb_w", [user_count, hidden_units])
item_emb_w = tf.get_variable("item_emb_w", [item_count, hidden_units // 2])
item_b = tf.get_variable("item_b", [item_count],
initializer=tf.constant_initializer(0.0))
cate_emb_w = tf.get_variable("cate_emb_w", [cate_count, hidden_units // 2])
cate_list = tf.convert_to_tensor(cate_list, dtype=tf.int64)
`
Below is the error displayed.
------------------------------------------------------------------------------------------------------------------------------------------------
2022-02-21 17:42:43.558189: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at random_op.cc:76 : Resource exhausted: OOM when allocating tensor with shape[94315979,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[94315979,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node user_emb_w/Initializer/random_uniform/RandomUniform}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "lookalike_model/trainer/train.py", line 179, in <module>
sess.run(tf.global_variables_initializer())
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[94315979,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node user_emb_w/Initializer/random_uniform/RandomUniform (defined at usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Original stack trace for 'user_emb_w/Initializer/random_uniform/RandomUniform':
File "/algorithm/lookalike_model/trainer/train.py", line 178, in <module>
model = Model(user_count, item_count, cate_count, cate_list, predict_batch_size, predict_ads_num)
File "/algorithm/lookalike_model/trainer/model.py", line 40, in __init__
user_emb_w = tf.get_variable("user_emb_w", [user_count, hidden_units])
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 1500, in get_variable
aggregation=aggregation)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 1243, in get_variable
aggregation=aggregation)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 567, in get_variable
aggregation=aggregation)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 519, in _true_getter
aggregation=aggregation)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 933, in _get_single_variable
aggregation=aggregation)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py", line 258, in __call__
return cls._variable_v1_call(*args, **kwargs)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py", line 219, in _variable_v1_call
shape=shape)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py", line 197, in <lambda>
previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 2519, in default_variable_creator
shape=shape)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py", line 262, in __call__
return super(VariableMetaclass, cls).__call__(*args, **kwargs)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py", line 1688, in __init__
shape=shape)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py", line 1818, in _init_from_args
initial_value(), name="initial_value", dtype=dtype)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 905, in <lambda>
partition_info=partition_info)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/init_ops.py", line 533, in __call__
shape, -limit, limit, dtype, seed=self.seed)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/random_ops.py", line 245, in random_uniform
rnd = gen_random_ops.random_uniform(shape, dtype, seed=seed1, seed2=seed2)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_random_ops.py", line 822, in random_uniform
name=name)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
self._traceback = tf_stack.extract_stack()
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@bluemarlin.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-bluemarlin] jimmylao commented on issue #51: [BLUEMARLIN-28] : For DIN-Lookalike model, train.py runs only if line 40 in model.py is commented.
Posted by GitBox <gi...@apache.org>.
jimmylao commented on issue #51:
URL: https://github.com/apache/incubator-bluemarlin/issues/51#issuecomment-1046956459
@Bimlesh759-AI
It's ok to comment out line 40 since it's not currently used in the model.
The variable "user_emb_w" is reserved in the model in case to use more user specific features. It is claimed to have the shape of user_count x hidden_units (128). In your case, I guess "user_count" (number of users) is huge and triggers a memory overflow error.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@bluemarlin.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org