You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@bluemarlin.apache.org by GitBox <gi...@apache.org> on 2022/02/21 13:45:44 UTC

[GitHub] [incubator-bluemarlin] Bimlesh759-AI opened a new issue #51: [BLUEMARLIN-28] : For DIN-Lookalike model, train.py runs only if line 40 in model.py is commented.

Bimlesh759-AI opened a new issue #51:
URL: https://github.com/apache/incubator-bluemarlin/issues/51


   1. Training fails at the beginning itself if line 40 in model.py is not commented. If we comment line 40 in model.py then train.py runs successfully. 
   In below code taken from model.py, if we comment ->  user_emb_w, then training is successful.
   hidden_units = 128
   
       user_emb_w = tf.get_variable("user_emb_w", [user_count, hidden_units])
       item_emb_w = tf.get_variable("item_emb_w", [item_count, hidden_units // 2])
       item_b = tf.get_variable("item_b", [item_count],
                                initializer=tf.constant_initializer(0.0))
       cate_emb_w = tf.get_variable("cate_emb_w", [cate_count, hidden_units // 2])
       cate_list = tf.convert_to_tensor(cate_list, dtype=tf.int64)
   `
     Below is the error displayed.
   ------------------------------------------------------------------------------------------------------------------------------------------------
   2022-02-21 17:42:43.558189: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at random_op.cc:76 : Resource exhausted: OOM when allocating tensor with shape[94315979,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
   Traceback (most recent call last):
     File "/usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
       return fn(*args)
     File "/usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
       target_list, run_metadata)
     File "/usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
       run_metadata)
   tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[94315979,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
   	 [[{{node user_emb_w/Initializer/random_uniform/RandomUniform}}]]
   Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
   
   
   During handling of the above exception, another exception occurred:
   
   Traceback (most recent call last):
     File "lookalike_model/trainer/train.py", line 179, in <module>
       sess.run(tf.global_variables_initializer())
     File "/usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
       run_metadata_ptr)
     File "/usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
       feed_dict_tensor, options, run_metadata)
     File "/usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
       run_metadata)
     File "/usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
       raise type(e)(node_def, op, message)
   tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[94315979,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
   	 [[node user_emb_w/Initializer/random_uniform/RandomUniform (defined at usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
   Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
   
   
   Original stack trace for 'user_emb_w/Initializer/random_uniform/RandomUniform':
     File "/algorithm/lookalike_model/trainer/train.py", line 178, in <module>
       model = Model(user_count, item_count, cate_count, cate_list, predict_batch_size, predict_ads_num)
     File "/algorithm/lookalike_model/trainer/model.py", line 40, in __init__
       user_emb_w = tf.get_variable("user_emb_w", [user_count, hidden_units]) 
     File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 1500, in get_variable
       aggregation=aggregation)
     File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 1243, in get_variable
       aggregation=aggregation)
     File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 567, in get_variable
       aggregation=aggregation)
     File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 519, in _true_getter
       aggregation=aggregation)
     File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 933, in _get_single_variable
       aggregation=aggregation)
     File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py", line 258, in __call__
       return cls._variable_v1_call(*args, **kwargs)
     File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py", line 219, in _variable_v1_call
       shape=shape)
     File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py", line 197, in <lambda>
       previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
     File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 2519, in default_variable_creator
       shape=shape)
     File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py", line 262, in __call__
       return super(VariableMetaclass, cls).__call__(*args, **kwargs)
     File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py", line 1688, in __init__
       shape=shape)
     File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py", line 1818, in _init_from_args
       initial_value(), name="initial_value", dtype=dtype)
     File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 905, in <lambda>
       partition_info=partition_info)
     File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/init_ops.py", line 533, in __call__
       shape, -limit, limit, dtype, seed=self.seed)
     File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/random_ops.py", line 245, in random_uniform
       rnd = gen_random_ops.random_uniform(shape, dtype, seed=seed1, seed2=seed2)
     File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_random_ops.py", line 822, in random_uniform
       name=name)
     File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
       op_def=op_def)
     File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
       return func(*args, **kwargs)
     File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
       attrs, op_def, compute_device)
     File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
       op_def=op_def)
     File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
       self._traceback = tf_stack.extract_stack()
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@bluemarlin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-bluemarlin] jimmylao commented on issue #51: [BLUEMARLIN-28] : For DIN-Lookalike model, train.py runs only if line 40 in model.py is commented.

Posted by GitBox <gi...@apache.org>.
jimmylao commented on issue #51:
URL: https://github.com/apache/incubator-bluemarlin/issues/51#issuecomment-1046956459


   @Bimlesh759-AI 
   It's ok to comment out line 40 since it's not currently used in the model.
   The variable "user_emb_w" is reserved in the model in case to use more user specific features. It is claimed to have the shape of user_count x hidden_units (128). In your case, I guess "user_count" (number of users) is huge and triggers a memory overflow error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@bluemarlin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org