Keras/TensorFlow OOM issue

Similar issues:

How to make sure the training phase won't be facing an OOM?

https://*.com/questions/58366819/how-to-make-sure-the-training-phase-wont-be-facing-an-oom

Just some side notes, based on my experience there are two cases of OOM. One is when the memory needed for your model and the mini-batch is bigger than the memory you have. In such cases, the training phase will never start. And the solution to fix this is to use smaller batch sizes. Even though it would have been great if I could calculate the biggest batch size that my hardware can manage for some particular model. But even if I cannot find the biggest batch size with the first try, I can always find it with some trial and error (since the process fails right away).

The second scenario that I'm facing OOM is when the training process starts, and it goes on for some time. Maybe even a few epochs. But then for some unknown reason, it faces the OOM. For me, this scenario is a frustrating one. Because it could happen any time and you'll never know if the training that is ongoing will ever finish or not. So far, I have lost days of training time while I thought everything is going forward just fine.

Resource exhausted error in the middle of training

https://github.com/tensorflow/tensorflow/issues/4735

 

Solutions:

  • gc.collect
  • clear session
  • smaller batch size (at the cost of longer training time)
  • smaller input size or fewer parameters (probably at the cost of worse performance)
  • Set TF_CUDNN_WORKSPACE_LIMIT_IN_MB to a smaller number. 
  • Restart the process periodically.
  • Improve the GPU block allocator to be more efficient.

 

TF_CUDNN_WORKSPACE_LIMIT_IN_MB is how much to use for scratch space. This is temporary scratch space that is used by individual cudnn ops. If you set it to 12000, you will have no space in your GPU to store tensors persistently or for TensorFlow to use as scratch space.

You are at the cusp of running out of memory. There is likely memory fragmentation going on which means that as your process goes on you are unable to allocate temporary buffers anymore. TF_CUDNN_WORKSPACE_LIMIT_IN_MB reduces the scratch space which reduces the chance you will see an out of memory. There are several things you

  1. Use a smaller batch size. You said that worked with 128 but is twice as slow. What about 192? Try using a binary search to choose the largest batch size.

  2. Restart the process periodically. Checkpoint every step 10000 steps and stop the training process and restart. This will reset the memory fragmentation.

  3. Improve the GPU block allocator to be more efficient. That memory allocator tries to avoid fragmentation. Insert instrumentation into it, try to measure the fragmentation and verify that is indeed your problem. Then, try to devise a better allocator algorithm that reduces fragmentation but doesn't reduce performance.

Obviously 1 and 2 are easy solutions and 3 is a long solution. Computers simply do not work well when pushed to the edge of memory utilization. Our memory allocators could be conservative and add a buffer and try to stop you at training step 0 before you get anywhere, but that would prevent some safe models from working and would not be guaranteed anyways. Memory allocation patterns are not deterministic in a parallel environment, unfortunately.

 

For observing GPU memory usage:

https://forums.fast.ai/t/show-gpu-utilization-metrics-inside-training-loop-without-subprocess-call/26594

https://github.com/anderskm/gputil

 

上一篇:memcached 的 cache 机制是怎样的?


下一篇:rsync同步时报错