A batch job is a non-interactive (no user input is possible) way to run an application in a pre-determined way. What happens during the batch job is controlled by the job script (sometimes known as "submit script"). When a batch job is submitted to the system, it is put in a queue, and is then started at a later time (sometimes immediately). An obvious advantage with this approach is that you can queue many batch jobs at the same time, which will start automatically once resources are available, i.e. you do not need to sit in front of the computer in order to start calculations.
Preparing a batch job:
Submitting a batch job:
Load any modules needed to run your job. The environment in the shell where you run "sbatch" will be saved and recreated when starting the job. This includes the current working directory. You can also place the "module load" commands in your job script, and they will be run automatically then the job starts.
Once in the queue, the job might be started immediately (if enough idle compute resources are available) or it might have to wait in the queue for a while (minutes, hours, days or in extreme cases even longer).
Different NSC systems have very different scheduling policies and utilization, so queue times vary significantly between systems and projects. See the system documentation for more details.
If you don't understand why your job won't start, please contact NSC Support.
You can monitor all your jobs, both batch and interactive, using the "squeue" command (e.g
squeue -u $USER to see your jobs).
When the job has started, the standard output and standard error from the job script (which will contain output from your application if you have not redirected it elsewhere) will be written to a file named
slurm-NNNNN.out in the directory where you submitted the job (NNNNN is replaced with the job ID).
If you need all the details about a pending or running job, use
scontrol show job JOBID. Use
squeue to find the job ID you need.
If you want to cancel (end) a queued or running job, use the
scancel command and provide the job ID (e.g
The environment (current working directory and environment variables such as $PATH) that was set when you submitted the job is recreated on the node where the job will be started.
The job script starts executing on the first node allocated to the job. If you have requested more than one node, your job script is responsible for starting your processes on all nodes in the job, e.g by using srun, ssh or an MPI launcher.
sbatch -t HH:MM:SS), the job will be killed automatically.
You can now fetch the output files generated by your job.
Sample job script: run an MPI application "mympiapp" on two "exclusive" (not shared with others) nodes
#!/bin/bash # #SBATCH -J myjobname #SBATCH -t 00:30:00 #SBATCH -N 2 #SBATCH --exclusive # mpprun ./mympiapp # Script ends here
Sample job script: run a single-threaded application on a single core and allocate 2GB RAM (the node might be shared with other jobs). Also send an email then when job starts and ends.
Note: jobs using parts of a node are only supported on certain NSC systems (e.g Triolith).
#!/bin/bash # #SBATCH -J myjobname #SBATCH -t 00:30:00 #SBATCH --mem=2000 #SBATCH -n 1 #SBATCH --mail-type=ALL #SBATCH --email@example.com # # Run a single task in the foreground. ./myapp --flags # # Script ends here
Hint: most of our clusters have a few nodes reserved for test and development (see the system documentation for details). Use these nodes to quickly check your job script before submitting it to the normal queue (where you might need to wait for hours or days before your job starts, only to find out that you made a simple error in the job script).
You can also use the
interactive command to get an interactive login session on a compute node. From there you can test your application and job script interactively in an environment that is almost 100% identical with the environment the real batch job will run in.
interactive takes the same command line options (e.g
The advantage of testing batch jobs in an interactive session is that you can quickly fix a bug, re-run the script, find another bug, fix it, ... This can speed up the process of debugging job scripts significantly compared to submitting them normally.
[x_makro@triolith1 ~]$ interactive -t 00:10:00 -n2 --reservation=devel Waiting for JOBID 1817147 to start [x_makro@n1 ~]$ bash myjob.sh myjob.sh: line 2: badspell: command not found
Now I edit myjob.sh and fix the problem, and run it again:
[x_makro@n1 ~]$ bash myjob.sh 1 2 3
Here I press Control-C to stop the job, as it seems to be working now.
Great, now all that remains is to end the interactive session (type
exit) and submit the job normally:
[x_makro@n1 ~]$ exit [x_makro@triolith1 ~]$ sbatch -t 3-00:00:00 -N 128 --exclusive myjob.sh Submitted batch job 1817151 [x_makro@triolith1 ~]$
The "wall time" limit (set with the
-t D-HH:MM:SS option to sbatch/interactive) determines how long your job may run (in actual hours, not core hours) before it is terminated by the system.
If your job ends before the time limit is up, your project will only be charged for the actual time used.
However, there are a few reasons for not always asking for the maximum allowed time:
We recommend adding a margin to the wall time setting to prevent jobs failing if they for some reason run slightly slower than expected (e.g due to high load in the disk storage system).
Please read the man pages (e.g run
man sbatch) on the cluster or read them online.