Using AlphaFold on Berzelius

Introduction
Preparations
Running AlphaFold Using the Module
- Loading the Module
- Running an Example
Running AlphaFold Using Conda
Running AlphaFold Using Apptainer
- Creating the Container Image
- Running an Example
Optimization
Best Practice of Running AlphaFold on Berzelius

Introduction

AlphaFold is a deep learning-based protein structure prediction program developed by DeepMind. The software uses a neural network to predict the 3D structure of a protein from its amino acid sequence. The first version of AlphaFold was released in 2018, and it was considered a breakthrough in the field of protein structure prediction. In 2021, AlphaFold2 won the CASP14 competition, a biennial competition that evaluates the state-of-the-art methods in protein structure prediction. AlphaFold2 was able to predict the structure of proteins with remarkable accuracy, which has implications for drug discovery and understanding diseases at the molecular level.

Preparations

Setting the Paths

We specify the paths for AlphaFold database, AlphaFold installation and results.

ALPHAFOLD_DB=/proj/common-datasets
ALPHAFOLD_DIR=/proj/nsc_testing/xuan/alphafold_2.3.1
ALPHAFOLD_RESULTS=/proj/nsc_testing/xuan/alphafold_results_2.3.1
mkdir -p ${ALPHAFOLD_DB} ${ALPHAFOLD_DIR} ${ALPHAFOLD_RESULTS}
mkdir -p ${ALPHAFOLD_RESULTS}/output ${ALPHAFOLD_RESULTS}/input

Downloading Genetic Databases

We have a copy of AlphaFold database on Berzelius at /proj/common-datasets.

The aria2 module will be used for downloading the AlphaFold database.

module load aria2/1.36.0-gcc-8.5.0
wget -O /tmp/v2.3.1.tar.gz https://github.com/deepmind/alphafold/archive/refs/tags/v2.3.1.tar.gz
tar -xf /tmp/v2.3.1.tar.gz -C ${ALPHAFOLD_DIR} --strip-components=1
cd ${ALPHAFOLD_DIR}
scripts/download_all_data.sh ${ALPHAFOLD_DB}

Downloading Test Data

The test input T1050.fasta can be found on this page. Download and save it to ${ALPHAFOLD_RESULTS}/input.

Patch

Our patch includes two new input arguments in run_alphafold.py.

n_parallel_msa
--n_parallel_msa=1: the searches are not parallelized.
--n_parallel_msa=3: the searches are all parallelized.
This new flag has been wrapped as -P in the wrapper.
run_feature_only
--run_feature_only=true: to only run MSA and template searches.
This new flag has been wrapped as -F in the wrapper.

The patch also provides the flexibility to choose the number of threads used for the MSA searches. Read the Optimization section for more details.

Running AlphaFold Using the Module

On Berzelius, we have a preinstalled AlphaFold 2.3.1 as a module.

Loading the Module

On a compute node we load the AlphaFold module.

module load AlphaFold/2.3.1-hpc1

Running an Example

We run an example.

run_alphafold.sh \
  -d ${ALPHAFOLD_DB} \
  -o ${ALPHAFOLD_RESULTS}/output \
  -f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
  -t 2021-11-01 \
  -g true \
  -P 3 \
  -F false

Please run run_alphafold.sh -h to check the usage.

To remove the module

module rm AlphaFold/2.3.1-hpc1

Running AlphaFold Using Conda

Creating a Conda Env

We first load the Anaconda module.

module load Anaconda/2021.05-nsc1

We create a conda env from a yml file.

git clone https://gitlab.liu.se/xuagu37/berzelius-alphafold-guide /tmp/berzelius-alphafold-guide
conda env create -f /tmp/berzelius-alphafold-guide/alphafold_2.3.1.yml
conda activate alphafold_2.3.1

Installing AlphaFold

To download AlphaFold

wget -O /tmp/v2.3.1.tar.gz https://github.com/deepmind/alphafold/archive/refs/tags/v2.3.1.tar.gz
tar -xf /tmp/v2.3.1.tar.gz -C ${ALPHAFOLD_DIR} --strip-components=1

To apply OpenMM patch

cd /home/xuan/.conda/envs/alphafold_2.3.1/lib/python3.8/site-packages/ 
patch -p0 < ${ALPHAFOLD_DIR}/docker/openmm.patch

To download chemical properties

wget -q -P ${ALPHAFOLD_DIR}/alphafold/common/ https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt

To install the patch

git clone https://gitlab.liu.se/xuagu37/berzelius-alphafold-guide /tmp/berzelius-alphafold-guide
cd /tmp && bash berzelius-alphafold-guide/patch/patch_2.3.1.sh ${ALPHAFOLD_DIR}

Running an Example

cd ${ALPHAFOLD_DIR}
bash run_alphafold.sh \
  -d ${ALPHAFOLD_DB} \
  -o ${ALPHAFOLD_RESULTS}/output \
  -f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
  -t 2021-11-01 \
  -g true \
  -P 3 \
  -F false

Please check the input arguments in run_alphafold.py. A complete list of input arguments is attached here for reference.

Running AlphaFold Using Apptainer

Creating the Container Image

There is an Apptainer image of AlphaFold 2.3.1 at /software/sse/containers.

Running an Example

apptainer exec --nv alphafold_2.3.1.sif bash -c "cd /app/alphafold && bash run_alphafold.sh \
  -d ${ALPHAFOLD_DB} \
  -o ${ALPHAFOLD_RESULTS}/output \
  -f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
  -t 2021-11-01 \
  -g true \
  -P 3 \
  -F false"

Optimization

MSA Searches in Parallel

The three independent sequential MSA searches can be arranged in parallel to accelerate the job. The parallelisation can be enabled by setting the flag -P 3.

jackhmmer(uniref90) + template_searcher(pdb)
jackhmmer(mgnify)
hhblits(bfd) or jackhmmer(small_bfd)

Ref 1: AlphaFold PR 399 Parallel execution of MSA tools.
Ref 2: Zhong et al. 2022, ParaFold: Paralleling AlphaFold for Large-Scale Predictions.

Multithreading for MSA Searches

AlphaFold 2.3.1 uses a default choice of 8, 8, 4 threads for the three MSA searches and it is not always the optimal. The hhblits search is the most time comsuming so we can manually set a large number of threads. The number of threads used for the three searches can be set in alphafold/data/pipeline.py at line 131 to 134.

For multimer models, the search of Jackhmmer (uniprot) will start when the first three searches finish. The number of threads can be set in alphafold/data/pipeline_multimer.py at line 179.

We recommend to use n_cpu=8, 8, 16, 32 on Berzelius for jackhmmer (uniref90), jackhmmer (mgnify), hhblits (bfd) and Jackhmmer (uniprot), respectively.

Separation of CPU Part (MSA and Template Searches) and GPU part (Predictions)

A flag --run_feature_only has been added to separate the CPU and GPU parts. Alphafold makes use of GPUs for the prediction part of the modelling, which can be a small part of the running time. Most of the operations executed by the script are CPU-based (MSA and template searches). We therefore strongly suggest to run the CPU part on Tetralith and the GPU part on Berzelius.

I/O Optimization

Each compute node has a local scratch file system used for temporary storage while the job running. The data will be deleted when the job finishes. The size of this disk on Berzelius at /scratch/local is 15 TB of NVMe SSD storage. We can copy the AlphaFold database to /scratch/local at the beginning of a job to achieve better I/O performance. However, based on our experiments on Berzelius, copying the database to the node local storage doesn't give any significant improvement regarding job running time.

Best Practice of Running AlphaFold on Berzelius

To make the best utilisations of the GPU resources on Berzelius, we strongly suggest to separate the CPU and GPU parts when running AlphaFold jobs. You should run the CPU part job on Tetralith or on your local computer, and then run the GPU part job on Berzelius.

Run the CPU part job on Tetralith.

You need to set -F true in the command to run MSA and template searches only.

Also, set -P 1 to run the MSA searches sequentially since the parallelisation on Tetralith will not give any improvement due to the I/O bottleneck. The CPU part job needs a large amount of memory so make sure that you have requested enough number of CPU cores. A sbatch script example has been prepared for you here.

On Tetralith, the AlphaFold database can be found at /proj/common_datasets/alphafold/v2.3.1.

Transfer the CPU part results from Tetralith to Berzelius via your local computer.
Run the GPU part job on Berzelius. You need to set -F false in the command. The command will skip the MSA and template searches and jump to the predictions directly.
To achieve better GPU utilisations, you can run several alphafold GPU part jobs concurrently. See the example sbatch script in which 5 GPU part jobs will be executed concurrently.

Using AlphaFold on Berzelius

Introduction

Preparations

Setting the Paths

Downloading Genetic Databases

Downloading Test Data

Patch

Running AlphaFold Using the Module

Loading the Module

Running an Example

Running AlphaFold Using Conda

Creating a Conda Env

Installing AlphaFold

Running an Example

Running AlphaFold Using Apptainer

Creating the Container Image

Running an Example

Optimization

MSA Searches in Parallel

Multithreading for MSA Searches

Separation of CPU Part (MSA and Template Searches) and GPU part (Predictions)

I/O Optimization

Best Practice of Running AlphaFold on Berzelius

User support

Getting access

Everything OK!

Self-service