sbatch
#!/bin/bash
#SBATCH –-job-name=myjobname
#SBATCH --nodes=1
#SBATCH --time=00:10:00
#SBATCH --mail-user=useremail@kaust.edu.sa
#SBATCH --mail-type=ALL
#SBATCH --error=JobName.%J.err
#SBATCH --output=JobName.%J.out
#SBATCH --partition=batch
#SBATCH --nodes=1
#SBATCH --gpus-per-node=v100:1
#SBATCH --cpus-per-gpu=12
#SBATCH --mem=64G
#Go to your working directory
cd /my_working_dir/
#Module load the desired application if necessary
module load module_name
#Edit below with the launching command:
your_commands_goes_here
To submit a job
sbatch myjobscript.sh
To cancel a job
scancel jobid
To check the status of your jobs
squeue –u username
srun
srun allow you to use cluster just like in terminal on your local machine. This is very useful when you want to debug your code. srun is convenient to use, however it will stop run when you lose connection with ibex. You need tmux to protect the node. When you lose connection, you can use tmux to login back into the node.
You can also srun into your allocated node using: srun --jobid=yourjobid --pty bash
. To do that, you have to use Sbatch at first to query for resources and start your training there. The srun is just used as a tube. After you srun into the node, you can check your mem usage using nvidia-smi
, etc.