Berzelius Common Datasets

To avoid data duplication and save hard drive space, we provide access to a selection of public datasets frequently used in AI/ML research. The datasets are available read-only under COMMON_DATASETS=/proj/common_datasets. Please refer to the List of Common Datasets on Berzelius for the information of version control and license.

Users are encouraged to contact us to request corrections, updates, or the addition of new datasets.

AlphaFold Genetic Databases

AlphaFold needs multiple genetic (sequence) databases to run:

  • BFD,
  • MGnify,
  • PDB70,
  • PDB (structures in the mmCIF format),
  • PDB seqres – only for AlphaFold-Multimer,
  • UniRef30 (FKA UniClust30),
  • UniProt – only for AlphaFold-Multimer,
  • UniRef90.

The dataset is available under $COMMON_DATASETS/AlphaFold.

COCO

The COCO (Common Objects in Context) dataset is a large-scale object detection, segmentation, and captioning dataset.

The dataset is available under $COMMON_DATASETS/COCO.

ImageNet

ImageNet is a large and widely used dataset in the field of computer vision, particularly in tasks involving image classification, object detection, and other types of visual recognition tasks. We provide the datasets for ImageNet Large-scale Visual Recognition Challenge (ILSVRC) 2012, including

  • ILSVRC2012_img_train: training images (Task 1 & 2)
  • ILSVRC2012_img_train_t3: training images (Task 3)
  • ILSVRC2012_img_val: validation images (all tasks)
  • ILSVRC2012_img_test_v10102019: test images (all tasks)

Please open the ImageNet site, find the terms of use (http://image-net.org/download), copy them, replace the needed parts with your name, send us an email including the terms with your name - thereby confirming you agree to the these terms. Once you do this, we can grant you access to the copy of the dataset on the cluster.

The dataset is available under $COMMON_DATASETS/ImageNet.

MNIST

MNIST is a handwritten digit database used for image processing and machine learning algorithms.

Four files are available:

  • train-images-idx3-ubyte: training set images
  • train-labels-idx1-ubyte: training set labels
  • t10k-images-idx3-ubyte: test set images
  • t10k-labels-idx1-ubyte: test set labels

The dataset is available under $COMMON_DATASETS/MNIST.

List of Common Datasets on Berzelius

Dataset Version Control License
AlphaFold - BFD No CC BY 4.0 Deed
AlphaFold - MGnify 2022_05 CC0
AlphaFold - PDB70 from_mmcif_200401 CC BY 4.0 Deed
AlphaFold - PDB No CC0
AlphaFold - PDB seqres No CC0
AlphaFold - UniRef30 2021_03 CC BY-SA 4.0 Deed
AlphaFold - UniProt 2022_05 CC BY 4.0
AlphaFold - UniRef90 2022_05 CC BY 4.0
AlphaFold - Parameters 2022-12-06 Apache License 2.0
COCO No CC BY 4.0
ImageNet No Terms of access
MNIST No CC BY-SA 3.0 Deed

User Area

User support

Guides, documentation and FAQ.

Getting access

Applying for projects and login accounts.

System status

Everything OK!

No reported problems

Self-service

SUPR
NSC Express