Colab GPU: Resource exhausted

2020/04/13 上午 06:53

電腦視覺深度學習討論版

觀看數：18

回答數：2

收藏數：0

Colab

加了 allow_growth 都未能跑完, 請問如何解決呢？

config = tf.ConfigProto() config.gpu_options.allow_growth = True

Epoch 50/50

5/5 [==============================] - 7s 1s/step - loss: 72.5174 - val_loss: 70.2399

Unfreeze all of the layers.

Train on 90 samples, val on 10 samples, with batch size 16.

Epoch 51/100

---------------------------------------------------------------------------

ResourceExhaustedError Traceback (most recent call last)

in ()

70 epochs=100,

71 initial_epoch=50,

---> 72 callbacks=[logging, checkpoint, reduce_lr, early_stopping])

73 model.save_weights(log_dir + 'trained_weights_final.h5')

6 frames

/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py in __call__(self, *args, **kwargs)

1470 ret = tf_session.TF_SessionRunCallable(self._session._session,

1471 self._handle, args,

-> 1472 run_metadata_ptr)

1473 if run_metadata:

1474 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

ResourceExhaustedError: 2 root error(s) found.

(0) Resource exhausted: OOM when allocating tensor with shape[16,105,105,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

[[{{node zero_padding2d_3/Pad}}]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[16,105,105,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

[[{{node zero_padding2d_3/Pad}}]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[loss_1/add_74/_5299]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.

0 derived errors ignored.

回答列表

2020/04/13 下午 05:32

胡連福

贊同數：0

不贊同數：0

留言數：0

這問題之前也遇過，主要是從 epoch_51 開始就發生資源被耗盡了，因為這時已 unfreeze all layers，會消耗更大的記憶體資源。建議你檢查:

是否先前在 colab 已做了很多次的訓練占用太多資源了 ? 可以先關機重新連結 colab 再 train 一次。
2020/10/01 下午 07:41

Patrick Ruan

贊同數：0

不贊同數：0

留言數：0

可以嘗試把 batch size 調小，來解決 OOM (out of memory)

同學從學 sdg 到 mini batch，除了了解 mini batch 對 gradient descent 的神效之外，也可以了解一下，記憶體資源不大時，減小 mini-batch 是很有幫助的。當然時間就會有所損失了。