To avoid data duplication and save hard drive space, we provide access to a selection of public datasets frequently used in AI/ML research. The datasets are available read-only under COMMON_DATASETS=/proj/common-datasets
.
Please refer to the List of Common Datasets on Berzelius for the information of version control and license.
Users are encouraged to contact us to request corrections, updates, or the addition of new datasets.
AlphaFold needs multiple genetic (sequence) databases to run:
The dataset is available under $COMMON_DATASETS/AlphaFold
.
The COCO (Common Objects in Context) dataset is a large-scale object detection, segmentation, and captioning dataset.
The dataset is available under $COMMON_DATASETS/COCO
.
ImageNet is a large and widely used dataset in the field of computer vision, particularly in tasks involving image classification, object detection, and other types of visual recognition tasks. We provide the datasets for ImageNet Large-scale Visual Recognition Challenge (ILSVRC) 2012, including
We also provide the training and validation images in both LMDB and TFRecord formats.
Please open the ImageNet site, find the terms of use (http://image-net.org/download), copy them, replace the needed parts with your name, send us an email including the terms with your name - thereby confirming you agree to the these terms. Once you do this, we can grant you access to the copy of the dataset on the cluster.
The dataset is available under $COMMON_DATASETS/ImageNet
.
MNIST is a handwritten digit database used for image processing and machine learning algorithms.
Four files are available:
The dataset is available under $COMMON_DATASETS/MNIST
.
This repository includes three datasets of manually annotated plankton images by phytoplankton experts at the Swedish Meteorological and Hydrological Institute (SMHI).
The dataset is available under $COMMON_DATASETS/SMHI-IFCB-Plankton
.
The SYKE-plankton_IFCB_2022 dataset consists of approximately 63,000 images representing 50 different classes of phytoplankton, collected using the Imaging FlowCytobot (IFCB) from various locations in the Baltic Sea. These images were manually annotated by expert taxonomists and are used to develop and evaluate classification methods for phytoplankton recognition.
The dataset is available under $COMMON_DATASETS/SYKE-plankton_IFCB_2022
.
The SYKE-plankton_IFCB_Utö_2021 dataset is a collection of approximately 150,000 images of phytoplankton, classified into 50 distinct categories, with an additional set of about 94,000 unclassifiable images. The dataset was collected using an Imaging FlowCytobot (IFCB) at the Utö Atmospheric and Marine Research Station in the Baltic Sea during 2021.
The dataset is available under $COMMON_DATASETS/SYKE-plankton_IFCB_Utö_2021
.
The Waymo Open Dataset is a publicly available dataset provided by Waymo, focused on autonomous driving technology. This dataset is designed to advance research and development in the field of autonomous driving by providing high-quality, diverse, and large-scale data collected from Waymo's fleet of autonomous vehicles.
To get access to the dataset, you need to:
Once you do this, send us an email and we can grant you access to the copy of the dataset on the cluster.
The dataset is available under $COMMON_DATASETS/Waymo
.
WHOI-Plankton is a comprehensive dataset of annotated plankton images developed by researchers at the Woods Hole Oceanographic Institution (WHOI). The dataset contains over 3.5 million images of microscopic marine plankton, categorized into 103 classes. These images are used primarily for developing and evaluating visual recognition models in plankton classification.
The dataset is available under $COMMON_DATASETS/WHOI-Plankton
.
Dataset | Version Control | License |
---|---|---|
AlphaFold - BFD | No | CC BY 4.0 Deed |
AlphaFold - MGnify | 2022_05 | CC0 |
AlphaFold - PDB70 | from_mmcif_200401 | CC BY 4.0 Deed |
AlphaFold - PDB | No | CC0 |
AlphaFold - PDB seqres | No | CC0 |
AlphaFold - UniRef30 | 2021_03 | CC BY-SA 4.0 Deed |
AlphaFold - UniProt | 2022_05 | CC BY 4.0 |
AlphaFold - UniRef90 | 2022_05 | CC BY 4.0 |
AlphaFold - Parameters | 2022-12-06 | Apache License 2.0 |
COCO | No | CC BY 4.0 |
ImageNet | No | Terms of access |
MNIST | No | CC BY-SA 3.0 Deed |
SMHI IFCB Plankton | version 2 | CC BY 4.0 |
SYKE-plankton_IFCB_2022 | 20220201 | CC BY 4.0 |
SYKE-plankton_IFCB_Utö_2021 | 20220428 | CC BY 4.0 |
Waymo Open Dataset - Motion Dataset | 1.2.1 | License Agreement |
Waymo Open Dataset - Perception Dataset | 2.0.1 | License Agreement |
WHOI-Plankton | No | MIT License |
Guides, documentation and FAQ.
Applying for projects and login accounts.