This article documents the use of the Anaconda data science platform at NSC and some related concepts, e.g., the conda command, conda-forge, mamba, etc. It includes advanced topics related to integrating compiled and Anaconda-provided software. More specific help on using Python at NSC (via Anaconda or other ways) is found at the NSC software page for Python.
The Anaconda data science platform is maintained by Anaconda Inc. It provides a way to distribute and run software primarily for scientific computing and data analysis. The emphasis is on Python and R with various supporting modules. However, much other software is also covered (in particular, software and libraries that the Python and R modules depend on). In practice, Anaconda works similarly to other container technologies: one can run a program (e.g., written in Python, R, a binary, etc.) in an Anaconda runtime environment - or "conda environment" for short - created from Anaconda packages that provide precise versions of a set of supporting software and libraries. These environments help ensure reproducible behavior across different systems.
The container-like nature of conda environments provides challenges for HPC clusters. Some software and libraries need to interact properly with supercomputing hardware to avoid breakage or degraded performance. Furthermore, the Anaconda-provided versions of some system programs may behave differently from those provided by NSCs systems, which can lead to unexpected behavior that is difficult to diagnose.
Conda-forge is a community-driven library of packages that work with the Anaconda system. These packages provide a rich library of additional software beyond the packages maintained by Anaconda Inc. However, since these are provided by the community of users, they may have undergone less testing and security review than packages provided by Anaconda Inc.
Mamba is an alternative open-source implementation of the
conda tool used to set up and maintain conda environments meant to address performance issues with the standard
Check the availability of NSCs Anaconda modules using the
module avail command:
$ module avail Anaconda ... Anaconda/2021.05-nsc1 Anaconda/2022.05-nsc1 ...
The default location for conda environment installations is in
~/.conda in your home directory. This location can be problematic since these environments can become very large. Therefore, it is suggested to redirect this directory using a symbolic link to a project space. For example if the
~/.conda directory already exists:
$ mv ~/.conda /proj/ourprojname/users/x_abcde/conda $ ln -s /proj/ourprojname/users/x_abcde/conda ~/.conda
(If you do not already have a
~/.conda directory, just skip the first command.)
Loading an NSC Anaconda module only gives you access to the
conda command (i.e., it does not alter your environment to enable a "base" conda environment). After loading an Anaconda module, you can issue
conda create to create a customized Python environment with exactly the packages (and versions) you need. A basic example for creating a conda environment called
myenv with Python 3.8 with the pandas and seaborn packages:
$ module load Anaconda/2022.05-nsc1 $ conda create -n myenv python=3.8 pandas seaborn $ conda activate myenv
Now the command
python will refer to the python installed in this environment and provide access to
$ which python ~/.conda/envs/myownenv/bin/python $ python Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:18) [GCC 10.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pandas >>> pandas.__version__ '1.4.2'
You can, of course, run a python program using this installed python as usual, e.g.:
$ python my_pandas_python_program.py
You do not have to create the environment again when you log in next time. You simply reactivate the same environment using:
$ module load Anaconda/2022.05-nsc1 $ conda activate myenv
and then, e.g., run a Python program in this environment:
$ python my_scipy_python_program.py
You can see what environments you have created by:
$ conda env list
Once you have activated an environment, you can modify it, e.g., by installing additional packages:
$ conda install cython
However, the dependency resolution for a request like this can sometimes lead to unforeseen and destructive changes to the environment, even as far as, e.g., removing packages or downgrading the Python version. Hence, we recommend that environments intended for use "in production" are not altered this way but reinstalled with all desired software constraints specified in a single go using a (possibly very long)
conda create command (or use of the conda feature to specify these requirements using a file:
conda env create -f environment.yml). Furthermore, you may want to try the
mamba alternative to
conda (see the section Mamba and Mambaforge), which tends to be more reliable, faster, and give a more clear output when figuring out how to alter an environment.
To find packages to install, you can use
conda search, which lists all available versions of matching packages, e.g.:
$ conda search sympy Loading channels: done # Name Version Build Channel sympy 1.1.1 py27_0 pkgs/main sympy 1.1.1 py27hc28188a_0 pkgs/main [...]
Alternatively, there is also an online list of packages available.
To list the packages installed in an activated environment use
conda list. You can check for the presence of a specific package using grep:
$ conda list | grep -i scipy scipy 1.8.0 py38h56a6a73_1 conda-forge
If you find yourself doing much work in the same conda environment, activating that environment automatically on every login may seem attractive. However, NSC strongly recommends against this: having every login run inside a conda environment may have far-reaching side effects that can be difficult to diagnose. However, it is fine to load an appropriate Anaconda module (but without also automatically activating an environment). You can do so by adding the following lines to your
export NSC_MODULE_SILENT=1 module load Anaconda/2021.05-nsc1 unset NSC_MODULE_SILENT
Here, the steps with
NSC_MODULE_SILENT hide the verbose output when loading the module. With this in your
.bashrc you only have to issue
conda activate <name> to activate your desired environment after logging in.
The Conda-forge community-driven library of packages greatly extends the software available for installation with
conda. Since these are provided by the community of users they may have undergone less testing and security review than packages provided by Anaconda Inc.
To instruct the
conda command to locate packages also in the conda-forge repository, add the flag
-c conda-forge to the
create commands, e.g., to install the package
$ conda install -c conda-forge ase
You can also add
conda-forge to the channels that are automatically considered for all installations in an active environment by:
$ conda config --add channels conda-forge
In this setup, it is generally recommended also to turn on "strict channel priority," which alters the behavior of conda so that packages available from multiple sources are only considered from the first source where they are available. This behavior avoids mixing packages from multiple sources in ways that may result in unintended behavior. To activate this setting, do:
$ conda config --set channel_priority strict
Mamba is a drop-in alternative to the conda command for which one of the central aims is to address performance issues with the standard
conda) available via the Mambaforge miniforge distribution, with which users can create, alter and activate conda environments. For example, to set up a customized environment and run a Python program
my_scipy_python_program.py in it:
$ module load Mambaforge/4.12.0-0-nsc1 $ mamba create -n myenv python=3.8 scipy=1.5.2 $ mamba activate myenv $ python my_scipy_python_program.py
To run the python program in the same environment when logging in the next time:
$ module load Mambaforge/4.12.0-0-nsc1 $ mamba activate myenv $ python my_scipy_python_program.py
mambaworks interchangeably with exactly the same conda environment directories as the
condacommand from the
Anacondamodules. Hence, it is generally safe to swap between loading
Mambaforgemodules, working with the same conda environments.
Mambaforgemodules have conda-forge as the default channel for packages. Without configuring it to do so, it will not install the packages provided by Anaconda Inc.
You should be able to replace more or less any occurrence of
conda in this document with
As mentioned above, software using MPI for parallelization provide additional challenges since it must interact correctly with NSC hardware. There are two main strategies to get Anaconda-installed MPI software to work at NSC:
condato install both the MPI software and MPI-supporting libraries that are compatible with NSC hardware.
condadependencies, making the conda software use the usual NSC-provided binaries and libraries.
Alternative 1 tends to happen automatically if one installs MPI software without carefully considering the dependencies. However, there is a risk that one ends up with a conda-installed MPI configuration and libraries incompatible with the NSC setup or with a degraded performance of the MPI communication. Nevertheless, such degraded performance may be acceptable for software where the performance of the intra-process communication is not critical. This situation appears somewhat common for Python programs using MPI via, e.g., the mpi4py Python module.
The chance of alternative 1 working well is improved by asking
conda to install specific MPI packages compatible with the versions provided by NSC. Either a carefully selected version of OpenMPI or, possibly easier, a version of MPICH compatible with the Intel MPI provided by NSC using the MPICH ABI compatibilitiy. A good choice on Tetralith/Sigma is the
4.<something> version series of OpenMPI and MPICH version 3.3.2.
For alternative 2 one instead asks
conda to install packages named as
mpich=<something>=external_*, corresponding as closely as possible to the versions available at NSC. These packages do not install any MPI binaries or libraries into the conda environment.
However, for this to work, one also needs to provide a way for the conda-installed software to find the NSC MPI libraries. The most direct way to do so is to manually create a symbolic link to the appropriate NSC MPI library into the lib directory of the conda environment. For example:
$ conda install "openmpi=4.1.4=external_*" -c conda-forge $ ln -s /software/sse/easybuild/prefix/software/OpenMPI/4.0.5-GCC-10.2.0/lib/libmpi.so.40 ~/.conda/myenv/lib/.
$ conda install "mpich=3.4.3=external_*" -c conda-forge $ ln -s /software/sse/easybuild/prefix/software/impi/2018.1.163-GCC-6.4.0-2.28/intel64/lib/release/libmpi.so.12 ~/.conda/myenv/lib/.
Depending on the details of how the software integrates with MPI, this may or may not work. If you get errors referring to "missing symbols", etc., feel free to contact NSC support for help. Alternatively, it may be easier to get software with tricky dependencies on MPI to work by following the below instructions for building a pip-provided version of the software instead of installing a pre-built conda version.
If you cannot locate a package for a particular software via
conda (either from the Anaconda Inc. default channel or conda-forge) it may be possible to use an alternative Python-oriented software package manager,
pip command is installable as an Anaconda package. Packages installed via this
pip command are placed into your active conda environment and will thus not affect your Python environment when the environment is not activated. (Hence, never use the
--user flag to
pip inside a conda environment since that flag will override this useful behavior.)
Some packages that can be installed with
pip require compilation (i.e., they are not pure Python). Such installations introduce extra complications that are addressed in the sections below.
conda commands will try to independently maintain their own sets of dependencies, which can lead to major issues with version conflicts. A good strategy is to add all packages you need with
conda first (preferably in a sinle go with one
conda create command) and then only add packages with
pip. Avoid going back and forth between
pip packages that do not require compilation: only install via
pip if you cannot find the corresponding package in
conda. (For packages that require compilation, see below for other constraints to consider.)
To install a package with
pip, first make sure you have the Anaconda version of
pip in your environment, then use the PyPi search function to find the package name and install the package:
$ conda install pip $ pip install python-hostlist
There are two main options for compiling software in relation to anaconda: either use the NSC-provided compilers or the compilers provided by the conda-forge package
In general, a binary executable and all its library dependencies (linked or dynamically loaded) should use a single compiler; one may otherwise see "missing symbols", version conflicts, or other errors. One may thus run into such issues if the software is built using the NSC compilers if it links to, dynamically loads, or is dynamically loaded by conda-provided software provided pre-built in binary format.
In particular, interpreters, e.g., Python, R, Perl, Octave, etc., have some modules/packages that dynamically load compiled binary libraries. Hence, if, for example, the
PyYAML Python package is built using NSC compilers, one may encounter problems if it is imported into a conda-provided
python. Likewise, a scientific software package downloaded from GitHub and compiled with NSC compilers inside a conda environment, linked with the conda-provided
libxml2 library, may also break. On the other hand, there is no issue if the scientific software instead executes the conda-provided
python as a separate process and runs a script using the conda-provided
libxml2 Python bindings.
Using the NSC compilers is the recommended way to build software integrating with other software in the conda environment by executing binaries (i.e., not via linked or dynamically loaded libraries). The following example shows how to set up a conda environment and use NSC-provided compilers to build a C program with source code in
my_example_program.c that expects to be able to execute
scipy to be available:
$ module load Anaconda/2022.05-nsc1 $ conda create -n example_env python=3 numpy scipy $ conda activate example_env $ module load buildenv-intel/2018a-eb $ icc my_example_program.c -o my_example_program
If the program being compiled integrates more tightly with conda-installed libraries (e.g., via linking or dynamic loading) or you are building, e.g., a Python module to be dynamically loaded by the conda-provided
python, the recommended strategy is to install and use the conda-provided compilers. Here is an example of how to set up a conda environment and build
my_example_program.c using compilers provided by the
compilers package from
$ module load Anaconda/2022.05-nsc1 $ conda create -n example_buildenv -c conda-forge python=3 compilers $ conda activate example_buildenv $ PATH="$CONDA_PREFIX/bin:/usr/local/bin:/usr/bin" cc my_example_program.c -o my_example_program
In the last line, we modify the
PATH environment variable just while executing
cc to circumvent helper scripts in the NSC environment meant to aid compilation with the NSC-provided compilers.
Some packages that can be installed via
pip requires compilation. It is possible to instruct
pip to use NSC-provided compiler commands for such builds. For example, the following instructions set up a conda environment and use the NSC compilers to build the PyYAML Python module using pip:
$ module load Anaconda/2022.05-nsc1 $ conda create -n yaml_env -c conda-forge python=3 pip $ conda activate yaml_env $ module load buildenv-intel/2018a-eb $ CC=icc CXX=icpc pip install pyyaml
The above can work for relatively simple pip packages (and at the time of writing this, ot seems to work for pyyaml). Nevertheless, as has been discussed above, this build may lead to problems (see Mixing conda and software compiled from source code or via pip) since we end up with a binary YAML library built with the NSC compilers that will be dynamically loaded into the conda-provided Python.
Hence, similar to when compiling software from source, the recommended strategy is to instead install and use the conda-provided compilers. Here is an example of how to do so to build the
pyyaml pip package:
$ module load Anaconda/2022.05-nsc1 $ conda create -n yaml_buildenv -c conda-forge python=3 pip compilers $ conda activate yaml_buildenv $ PATH="$CONDA_PREFIX/bin:/usr/local/bin:/usr/bin" pip install pyyaml --no-cache-dir --global-option=build_ext --global-option="--rpath=$ORIGIN/../../.."
PATHenvironment variable just while executing
pip install pyyamlto circumvent helper scripts in the NSC environment meant to aid compilation with NSC-provided compilers.
--no-cache-dirand the two
--global-optionparameters to ensure the rpath feature is set up for the compiled library consistently with how other conda-installable software is built.
Software that simultaneously integrates with conda-installed packages and uses MPI for parallelization adds another layer of complexity. This situation is addressed in the next section.
Just as for non-MPI software, if the software that is being compiled only interacts with software in the conda environment by invoking binaries, the recommended way to build it is to use the NSC toolchains. For example, lets consider a software using MPI, which is built using the
make command and, when run, invokes
python as a binary expecting
scipy to be available. This software can be built with the following steps:
$ module load Anaconda/2022.05-nsc1 $ conda create -n example_mpi_env python=3 numpy scipy $ conda activate example_mpi_env $ module load buildenv-intel/2018a-eb $ CC=mpicc make
Where, in this example, we assume that setting the environment variable
mpicc is how to tell the Makefile to build the software using that compiler (other software may use other means of configuration).
As a counter-example, the following example tries to creates a conda environment in which to build and install the
asap3 pip package, compiling it using an NSC toolchain:
$ module load Anaconda/2022.05-nsc1 $ conda create -n asap3_env -c conda-forge python=3 numpy ase $ conda activate asap3_env $ module load buildenv-intel/2018a-eb $ CC=mpiicc CXX=mpiicpc pip install asap3
However, the resulting
asap3 library does not work:
$ python3 ./md.py ... ImportError: /software/sse/easybuild/prefix/software/GCCcore/6.4.0/lib64/libstdc++.so.6: version `GLIBCXX_3.4.26' not found (required by /home/rar/.conda/envs/asap3_env/lib/python3.11/site-packages/scipy/linalg/_matfuncs_sqrtm_triu.cpython-311-x86_64-linux-gnu.so)
As discussed in the previous sections,
asap3 provides a library to be loaded dynamically into the
python provided by conda. In this case, this leads to a mismatch between glibc versions. However, even if the clash of glibc versions had not occurred,
asap3 would have to interact with
ase, which also comes with MPI support but is compiled using a different set of MPI libraries which would likely have caused further issues.
Two strategies for resolving this issue are explored in the subsections below.
The most straightforward strategy to build MPI software that integrates tightly with conda-provided packages is to combine the conda-provided compilers in the
compilers module with the MPI libraries provided by NSC. Lets consider a software using MPI, which is built using the
make command and links to the
libnetcdf library provided by the conda environement. This software can be built with the following steps:
$ module load Anaconda/2022.05-nsc1 $ conda create -n example_mpi_env python=3 numpy scipy $ conda activate example_mpi_env $ module load buildenv-intel/2018a-eb $ PATH="$CONDA_PREFIX/bin:/usr/local/bin:/usr/bin:/software/sse/easybuild/prefix/software/impi/2018.1.163-iccifort-2018.1.163-GCC-6.4.0-2.28/bin64" CC=mpigcc CXX=mpigxx I_MPI_CC="cc" I_MPI_CXX="c++" make
Where, in the example, we assume that the Makefile is set up to use the environment variables
CXX to build the software using those compilers (other software may use other means of configuration). On the line executing the
make command we:
PATHenvironment variable to circumvent helper scripts in the NSC environment meant to aid compilation with NSC-provided compilers and instead point at specific versions of the Intel MPI wrappers suitable for combination with the conda-provided GNU
c++compilers via the
PATHenvironment variable to circumvent helper scripts in the NSC environment meant to aid compilation with NSC-provided compilers.
Similarly, to build the
asap3 pip package, the steps would be as follows:
$ module load Anaconda/2022.05-nsc1 $ conda create -n buildenv_mpi -c conda-forge python=3 "mpich=3.4.3=external_*" compilers numpy ase $ conda activate buildenv_mpi $ module load buildenv-intel/2018a-eb $ PATH="$CONDA_PREFIX/bin:/usr/local/bin:/usr/bin:/software/sse/easybuild/prefix/software/impi/2018.1.163-iccifort-2018.1.163-GCC-6.4.0-2.28/bin64" CC=mpigcc CXX=mpigxx I_MPI_CC="cc" I_MPI_CXX="c++" pip install asap3 --no-cache-dir --global-option=build_ext --global-option="--rpath=$ORIGIN/../../.."
As an alternative to the above strategy, it is possible to set up a conda environment with the
compilers package and a conda-provided MPI package that is sufficiently compatible with the NSC MPI setup. The following steps build
asap3 this way:
$ module load Anaconda/2022.05-nsc1 $ conda create -n buildenv_mpi -c conda-forge python=3 "mpich=3.4.3" compilers numpy ase $ conda activate buildenv_mpi $ PATH="$CONDA_PREFIX/bin:/usr/local/bin:/usr/bin" CC=mpicc CXX=mpic++ MPICH_CC="" MPICH_CXX="" pip install asap3 --no-cache-dir --global-option=build_ext --global-option="--rpath=$ORIGIN/../../.."
Where, on the line executing the build command
pip install asap3 we:
PATHenvironment variable to circumvent helper scripts in the NSC environment meant to aid compilation with NSC-provided compilers.
pipto use the MPI-wrapped versions of the C and C++ compiles, i.e.,
MPICH_CXXto ensure these wrappers use the standard (i.e., the conda-provided) C and C++ compilers.
The above build and environment often work due to the MPICH ABI compatibility with Intel MPI.
When the above build is completed with the conda-provided compilers and MPI, one may want to explore replacing the conda-provided MPICH MPI package with one referencing the external system MPI software in an attempt to end up with the type of environment discussed as recommended in Conda packages and MPI, which could potentially improve MPI performance. The steps to do so are as follows:
$ conda install -c conda-forge "mpich=3.4.3=external_*" $ ln -s /software/sse/easybuild/prefix/software/impi/2018.1.163-GCC-6.4.0-2.28/intel64/lib/release/libmpi.so.12 "$CONDA_PREFIX/lib/."
However, depending on the details of how the software integrates with the MPI library, this may not work. In the case of
asap3, this strategy does not work, and trying to start a parallelized MPI run now results in a runtime "No MPI error".
Guides, documentation and FAQ.
Applying for projects and login accounts.