Colab GPU: Resource exhausted
加了 allow_growth 都未能跑完, 請問如何解決呢?
config = tf.
ConfigProto
()
config.gpu_options.allow_growth = True
Epoch 50/50
5/5 [==============================] - 7s 1s/step - loss: 72.5174 - val_loss: 70.2399
Unfreeze all of the layers.
Train on 90 samples, val on 10 samples, with batch size 16.
Epoch 51/100
---------------------------------------------------------------------------
ResourceExhaustedError Traceback (most recent call last)
in ()
70 epochs=100,
71 initial_epoch=50,
---> 72 callbacks=[logging, checkpoint, reduce_lr, early_stopping])
73 model.save_weights(log_dir + 'trained_weights_final.h5')
6 frames
/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py in __call__(self, *args, **kwargs)
1470 ret = tf_session.TF_SessionRunCallable(self._session._session,
1471 self._handle, args,
-> 1472 run_metadata_ptr)
1473 if run_metadata:
1474 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[16,105,105,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node zero_padding2d_3/Pad}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: OOM when allocating tensor with shape[16,105,105,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node zero_padding2d_3/Pad}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[loss_1/add_74/_5299]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations.
0 derived errors ignored.
回答列表
-
2020/04/13 下午 05:32胡連福贊同數:0不贊同數:0留言數:0
這問題之前也遇過,主要是從 epoch_51 開始就發生資源被耗盡了,因為這時已 unfreeze all layers,會消耗更大的記憶體資源。建議你檢查:
是否先前在 colab 已做了很多次的訓練占用太多資源了 ? 可以先關機重新連結 colab 再 train 一次。
-
2020/10/01 下午 07:41Patrick Ruan贊同數:0不贊同數:0留言數:0
可以嘗試把 batch size 調小,來解決 OOM (out of memory)
同學從學 sdg 到 mini batch,除了了解 mini batch 對 gradient descent 的神效之外,也可以了解一下,記憶體資源不大時,減小 mini-batch 是很有幫助的。當然時間就會有所損失了。