在docker中使用PyTorch时共享内存问题

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)

问题

1
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)

出现这个错误的情况是,在服务器上的docker中运行训练代码时,batch size设置得过大,shared memory不够(因为docker限制了shm).

根据PyTorch README:

Please note that PyTorch uses shared memory to share data between processes, so if torch multiprocessing is used (e.g. for multithreaded data loaders) the default shared memory segment size that container runs with is not enough, and you should increase shared memory size either with —ipc=host or —shm-size command line options to nvidia-docker run.

解决方案

  1. 这里说明了,PyTorch的IPC会利用共享内存,所以共享内存必须足够大,可以通过docker run --shm-size进行修改
  2. 通过设置 --ipc=host
  3. 将Dataloader的num_workers设置为0.但训练会变慢

yolov3 issue#283

PyTorch On K8S 共享内存问题定位

Pytorch的12个坑