The Stanford AI Lab cluster aggregates research compute nodes from various groups within the lab and control them via a central batch queueing system that coordinates all jobs running on the cluster. The nodes should not be accessed directly, as the scheduler will allocate resources such as CPU, Memory and GPU exclusively to each job.
Once you have access to use the cluster, you can submit, monitor, and cancel jobs from the headnode, sc.stanford.edu. This machine should not be used for any compute-intensive work, however you can get a shell on a compute node simply by starting an interactive job. You may also monitor (read-only) your jobs and the status of the cluster using the web-based dashboard at https://sc.stanford.edu.
You can use the cluster by starting batch jobs or interactive jobs. Interactive jobs give you access to a shell on one of the nodes, from which you can execute commands by hand, whereas batch jobs run from a given shell script in the background and automatically terminate when finished.
If you encounter any problems using the cluster, please send us a request via http://support.cs.stanford.edu and be as specific as you can when describing your issue.
To gain access to the cluster, please submit and request via http://support.cs.stanford.edu and state the following: (i) your CS login ID, (iii) name of professor you're working with (and put him under cc on the form)
If we have any trouble with your job, we will try to get in touch with you but we reserve the right to kill your jobs at any time.
If you have questions about the cluster, send us a request at http://support.cs.stanford.edu.
Use of the cluster is coordinated by a batch queue scheduler, which assigns compute nodes to jobs in an order that depends on the time submitted, the number of nodes requested, and the availability of the resources being requested (etc. GPU, Memory).
You can submit two kinds of jobs to the cluster- interactive and batch.
Interactive jobs give you access to a shell on one of the nodes, from which you can execute commands by hand, whereas batch jobs run from a given shell script in the background and automatically terminate when finished.
Generally speaking, interactive jobs are used for building, prototyping and testing, while batch jobs are used thereafter.
Batch jobs are the most common way to interact with the cluster, and are useful when you do not need to interact with the shell to perform the desired task. Two clear advantages are that your job will be managed automatically after submission, and that placing your setup commands in a shell script lets you efficiently dispatch multiple similar jobs. To start a simple batch job on a partition (group you work with, see bottom of the page), ssh into sc and type:
There are many parameters you can define based on your requirement. You can reference to a sample submit script I have via /sailhome/software/sample-batch.sh.
For further documentation on submitting batch jobs via Slurm, see the online sbatch documentation via SchedMD.
Our friends at the Stanford Research Computing Center who runs the Sherlock cluster via Slurm, also has a wonderful write-up and they largely applies to us too. Sherlock Cluster
Interactive jobs are useful for compiling and prototyping code intended to run on the cluster, performing one-time tasks, and executing software that requires runtime feedback. To start an interactive job, ssh into sc and type:
srun --partition=mypartition --pty bash
The above will allocate a node in mypartition and drop you into a bash shell. You can also add other parameter as necessary.
srun --partition=mypartition --nodelist=node1 --gres=gpu:1 --pty bash
The above will allocate node1 in mypartition with 1 GPU and drop you into a bash shell.
For further documentation on the srun command, see the online srun documentation via SchedMD.
You can view a list of all jobs running on the cluster by typing:
Or via the online-dashboard at http://sc.stanford.edu
You can view detailed information for a specific job by typing:
scontrol show job jobid
Or via the online-dashboard at http://sc.stanford.edu and click on the job
To cancel a job you started, type:
A good comparison between torque/pbs command vs. Slurm, please head to https://www.sdsc.edu/~hocks/FG/PBS.slurm.html
There are several storage options for the scail cluster,
Home directory: /sailhome/csid
All sc cluster nodes mount a common network volumes for your home directory. This is a good option for submission scripts, outputs ...etc, there is a quota of 20GB for each user.
Scratch Storage via NFS:
/scail/scratch and /scail/data - old/general network filesystem across SAIL, to be deprecated soon.
/atlas - Prof. Stefano Ermon
/cvgl, /cvgl2 - Prof. Silvio Savarese
/deep - Prof. Andrew Ng
/next - Prof. Emma Brunskill
/vision - Prof. FeiFei Li and Juan Carlos Niebles
All NLP NFS filesystems, including /u/nlp, are automounted on sc and the NLP nodes - Prof. Chris Manning
These are the partitions currently enabled on sc (list will grow soon as we are migrating more production GPU nodes), please only submit jobs to the partitions in which your group owns. More details about the node's at https://cs.stanford.edu/csdcf/sail-compute-cluster/hardware
atlas - atlas[1-10] - GPU nodes for Prof. Stefano Ermon, various type of GPUs
deep - deep[1-16] - GPU nodes for Prof. Andrew Ng, each with 4 GTX 1070
jag - jagupard[14-20] - each node has 4 TitanV (Volta) GPUs
macondo - macondo - GPU nodes for FeiFei/Juan Carlos, each with 9 GTX 1080ti
napoli-gpu - napoli[111,112] - GPU node for CVGL, various type of GPUs
next - next[1-2] - GPU nodes for Prof. Emma Brunskill, each with 9 GTX 1080ti
tibet - tibet[10-15] - each node has 4 K40 GPUs
CPU-only nodes (shared for all users within the cluster)
gorgon - gorgon[50-69] - cpu-only node for Prof. Andrew Ng and the Deep Learning group
napoli-cpu - napoli[1-7,9-16] - cpu-only node for CVGL
visionlab - visionlab[1-25] - cpu-only node for FeiFei/Juan Carlos, docker/docker-compose available.