AlphaFold is a deep learning-based protein structure prediction program developed by DeepMind. The software uses a neural network to predict the 3D structure of a protein from its amino acid sequence. The first version of AlphaFold was released in 2018, and it was considered a breakthrough in the field of protein structure prediction. In 2021, AlphaFold2 won the CASP14 competition, a biennial competition that evaluates the state-of-the-art methods in protein structure prediction. AlphaFold2 was able to predict the structure of proteins with remarkable accuracy, which has implications for drug discovery and understanding diseases at the molecular level.
We specify the paths for AlphaFold database, AlphaFold installation and results.
ALPHAFOLD_DB=/proj/common-datasets
ALPHAFOLD_DIR=/proj/nsc_testing/xuan/alphafold_2.3.1
ALPHAFOLD_RESULTS=/proj/nsc_testing/xuan/alphafold_results_2.3.1
mkdir -p ${ALPHAFOLD_DB} ${ALPHAFOLD_DIR} ${ALPHAFOLD_RESULTS}
mkdir -p ${ALPHAFOLD_RESULTS}/output ${ALPHAFOLD_RESULTS}/input
We have a copy of AlphaFold database on Berzelius at /proj/common-datasets
.
The aria2 module will be used for downloading the AlphaFold database.
module load aria2/1.36.0-gcc-8.5.0
wget -O /tmp/v2.3.1.tar.gz https://github.com/deepmind/alphafold/archive/refs/tags/v2.3.1.tar.gz
tar -xf /tmp/v2.3.1.tar.gz -C ${ALPHAFOLD_DIR} --strip-components=1
cd ${ALPHAFOLD_DIR}
scripts/download_all_data.sh ${ALPHAFOLD_DB}
The test input T1050.fasta
can be found on this page. Download and save it to ${ALPHAFOLD_RESULTS}/input
.
Our patch includes two new input arguments in run_alphafold.py
.
n_parallel_msa--n_parallel_msa=1
: the searches are not parallelized.--n_parallel_msa=3
: the searches are all parallelized.
This new flag has been wrapped as -P
in the wrapper.
run_feature_only--run_feature_only=true
: to only run MSA and template searches.
This new flag has been wrapped as -F
in the wrapper.
The patch also provides the flexibility to choose the number of threads used for the MSA searches. Read the Optimization section for more details.
On Berzelius, we have a preinstalled AlphaFold 2.3.1 as a module.
On a compute node we load the AlphaFold module.
module load AlphaFold/2.3.1-hpc1
We run an example.
run_alphafold.sh \
-d ${ALPHAFOLD_DB} \
-o ${ALPHAFOLD_RESULTS}/output \
-f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
-t 2021-11-01 \
-g true \
-P 3 \
-F false
Please run run_alphafold.sh -h
to check the usage.
To remove the module
module rm AlphaFold/2.3.1-hpc1
We first load the Anaconda module.
module load Anaconda/2021.05-nsc1
We create a conda env from a yml file.
git clone https://gitlab.liu.se/xuagu37/berzelius-alphafold-guide /tmp/berzelius-alphafold-guide
conda env create -f /tmp/berzelius-alphafold-guide/alphafold_2.3.1.yml
conda activate alphafold_2.3.1
To download AlphaFold
wget -O /tmp/v2.3.1.tar.gz https://github.com/deepmind/alphafold/archive/refs/tags/v2.3.1.tar.gz
tar -xf /tmp/v2.3.1.tar.gz -C ${ALPHAFOLD_DIR} --strip-components=1
To apply OpenMM patch
cd /home/xuan/.conda/envs/alphafold_2.3.1/lib/python3.8/site-packages/
patch -p0 < ${ALPHAFOLD_DIR}/docker/openmm.patch
To download chemical properties
wget -q -P ${ALPHAFOLD_DIR}/alphafold/common/ https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt
To install the patch
git clone https://gitlab.liu.se/xuagu37/berzelius-alphafold-guide /tmp/berzelius-alphafold-guide
cd /tmp && bash berzelius-alphafold-guide/patch/patch_2.3.1.sh ${ALPHAFOLD_DIR}
cd ${ALPHAFOLD_DIR}
bash run_alphafold.sh \
-d ${ALPHAFOLD_DB} \
-o ${ALPHAFOLD_RESULTS}/output \
-f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
-t 2021-11-01 \
-g true \
-P 3 \
-F false
Please check the input arguments in run_alphafold.py
. A complete list of input arguments is attached here for reference.
There is an Apptainer image of AlphaFold 2.3.1 at /software/sse/containers
.
apptainer exec --nv alphafold_2.3.1.sif bash -c "cd /app/alphafold && bash run_alphafold.sh \
-d ${ALPHAFOLD_DB} \
-o ${ALPHAFOLD_RESULTS}/output \
-f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
-t 2021-11-01 \
-g true \
-P 3 \
-F false"
The three independent sequential MSA searches can be arranged in parallel to accelerate the job. The parallelisation can be enabled by setting the flag -P 3
.
Ref 1: AlphaFold PR 399 Parallel execution of MSA tools.
Ref 2: Zhong et al. 2022, ParaFold: Paralleling AlphaFold for Large-Scale Predictions.
AlphaFold 2.3.1 uses a default choice of 8, 8, 4 threads for the three MSA searches and it is not always the optimal. The hhblits search is the most time comsuming so we can manually set a large number of threads. The number of threads used for the three searches can be set in alphafold/data/pipeline.py
at line 131 to 134.
For multimer models, the search of Jackhmmer (uniprot) will start when the first three searches finish. The number of threads can be set in alphafold/data/pipeline_multimer.py
at line 179.
We recommend to use n_cpu=8, 8, 16, 32
on Berzelius for jackhmmer (uniref90), jackhmmer (mgnify), hhblits (bfd) and Jackhmmer (uniprot), respectively.
A flag --run_feature_only
has been added to separate the CPU and GPU parts. Alphafold makes use of GPUs for the prediction part of the modelling, which can be a small part of the running time. Most of the operations executed by the script are CPU-based (MSA and template searches). We therefore strongly suggest to run the CPU part on Tetralith and the GPU part on Berzelius.
Each compute node has a local scratch file system used for temporary storage while the job running. The data will be deleted when the job finishes. The size of this disk on Berzelius at /scratch/local
is 15 TB of NVMe SSD storage. We can copy the AlphaFold database to /scratch/local at the beginning of a job to achieve better I/O performance. However, based on our experiments on Berzelius, copying the database to the node local storage doesn't give any significant improvement regarding job running time.
To make the best utilisations of the GPU resources on Berzelius, we strongly suggest to separate the CPU and GPU parts when running AlphaFold jobs. You should run the CPU part job on Tetralith or on your local computer, and then run the GPU part job on Berzelius.
You need to set -F true
in the command to run MSA and template searches only.
Also, set -P 1
to run the MSA searches sequentially since the parallelisation on Tetralith will not give any improvement due to the I/O bottleneck. The CPU part job needs a large amount of memory so make sure that you have requested enough number of CPU cores. A sbatch script example has been prepared for you here.
On Tetralith, the AlphaFold database can be found at /proj/common_datasets/alphafold/v2.3.1
.
Transfer the CPU part results from Tetralith to Berzelius via your local computer.
Run the GPU part job on Berzelius. You need to set -F false
in the command. The command will skip the MSA and template searches and jump to the predictions directly.
To achieve better GPU utilisations, you can run several alphafold GPU part jobs concurrently. See the example sbatch script in which 5 GPU part jobs will be executed concurrently.
Guides, documentation and FAQ.
Applying for projects and login accounts.