R is a language and environment for statistical computing and graphics. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible.
R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes
The R installations at NSC are generally maintained by Johan Raber (firstname.lastname@example.org), but please use our support address (email@example.com) for questions, problem reports etc.
The installed R versions before version 3.0 were compiled with GCC and without optimised linear algebra libraries. Version from 3.0 and above were compiled with the Intel compiler and MKL/OpenBLAS linear algebra libraries and outperforms previous installations by as much as a factor three in some benchmarks. If you are interested in seeing how it was built, check out the file /software/apps/R/3.1.1/build.txt.
Care must be taken when running the parallelised parts of R, as the parallelisation model of various parts of R may conflict with the more standards based BLAS libraries. Case in point is the R “parallel” package which does not use OpenMP for parallelisation (as the BLAS library does) and should you start as many threads as there are cores in the server and have a setting on the OMP_NUM_THREADS environment variable higher than one (the default), you will end up with degraded performance overall. However, if you run a non-parallelised part of R but do run taxing linear algebra operations, e.g. large matrix inversions, it will be highly beneficial to set OMP_NUM_THREADS higher than one. You will need to experiment with this setting to achieve optimal performance but there’s no point in going over 16 as there are no more CPU cores in the servers of Triolith.
The IDE Rstudio has become quite popular among R developers and it is consequently available on Triolith. Access it with for instance
module load rstudio/0.98.1028
We recommend that you use it in conjunction with the VNC solution ThinLinc rather than via X forwarding even though you certainly can do so.
Load the R module corresponding to the version you want to use. To see which versions are available do a
module avail R
We strongly recommend to use the latest version of R when you have a choice. For instance
module load R/3.0.1
For doing interactive R work, first allocate a node for your work
interactive -N 1 --exclusive -t 8:00:00 -A <your_project_account>
This allocates one node exclusively for you for 8h. The <your_project_accoount> string is the SNIC or local project name you want to use. If you have only one project, this can be omitted. The “projinfo” command will give you a list of projects you belong to. Note that it may take a while to get a node allocated depending on your priority and available resources. Your priority is a function of how much time you have spent of your allocation in the last 30 days, vis-à-vis the priority of everybody else in the batch queue.
If you only plan to do a shorter interactive stint, you can use the development nodes of Triolith which have a wall time limit of one hour only, but are on the other hand most often less used and therefore easier to get allocated. This is a good way to do some quick debugging. Allocate like
interactive -N 1 --exclusive -t 1:00:00 -A <your_project_account> --reservation=devel
After you get a node allocated, either launch R on the command line
or load the R IDE Rstudio module and launch Rstudio like
module load rstudio/0.97.551 rstudio
Using Rstudio requires you to have either logged in with X forwarding to the login nodes or better yet used the VNC solution ThinLinc. A very important flag to “interactive” (and sbatch) to know about, is the “-C” flag which can be used to allocate a “fat” node, i.e. a node with substantially more memory installed than the baseline 32 GB of triolith. On triolith the fat nodes are currently equipped with 128 GB RAM. To get a fat node add the option “-C fat” to “interactive” or you batch processing script.
A minimum batch script for running R looks like this:
#!/bin/bash #SBATCH -N 1 #SBATCH -t 4:00:00 #SBATCH -J jobname #SBATCH --exclusive #SBATCH -A SNIC-xxx-yyy module load R/<desired_version> R CMD BATCH [options] R_script_name.R
Note that you should edit the jobname, account number and desired R version before submitting. The brackeded options are of course optional and should be removed if you don’t use them. To get a fat node, add an SBATCH line saying “#SBATCH -C fat” to the above script.
There are some “gotchas” to be aware of:
From version 3.0 the R installations uses the threaded versions of Intel MKL for the linear algebra routines and by default when loading the R module the environment variable OMP_NUM_THREADS is set to 1 if it was unset at module load time. If you allocate a full node for your work, you may want to set OMP_NUM_THREADS to 16 to make good use of the resources, e.g for bash you would do “export OMP_NUM_THREADS=16” and for csh you would do “setenv OMP_NUM_THREADS 16” before launching R (or Rstudio). As mentioned above, what you set OMP_NUM_THREADS to is also influenced by what R work you do besides linear algebra. Do not run on the login node with OMP_NUM_THREADS=16!
From version 3.0, the R installations at NSC were built with the Intel compilers and old packages are unlikely to be compatible. A quick way to recompile your old packages to this new version of R is to launch the old R and do a
> my_packages <- as.vector(installed.packages(lib.loc = .libPaths())[,1]) > q("yes")
Now load the new R version module and launch R to do
In general, R packages are compatible between bugfix releases but not feature releases, i.e. compatibility can be expected within the Z series in R version X.Y.Z, but not between different X and Y releases.
For sharing packages in a group, you can make a shared folder under your project storage (e.g. /proj/name), and installing the R packages in the shared folder:
# Create a shared folder mkdir /proj/name/rlibrary # Start R module load R/3.3.2 R # Install packages in the shared folder install.packages('igraph', lib='/proj/name/rlibrary') install.packages('biomaRt', lib='/proj/name/rlibrary') ...
When loading an R package from the shared folder, you would then need to tell R where to find the packages, either by putting this somewhere in the top of your R scripts:
or by specifying the path while loading the packages:
or by setting the environment variable R_LIBS before running R/Rscript:
env R_LIBS=/proj/name/rlibrary Rscript somescript.R