See this page for a basic guide of how to copy files to and from Centre Storage using scp, sftp, rsync, etc.
Note: the advanced multiple-pass rsync method described below might be useful when moving large amounts of data to and from Centre Storage.
When moving data within a project directory, you can use the normal Linux "mv" command.
mv -i /proj/someproject/users/x_abcde/dataset /proj/someproject/shared_datasets/
-i option ensures that you won't accidentally overwrite existing files.
The above works regardless if
dataset is a directory, a directory tree or a file.
The destination directory
/proj/someproject/shared_datasets must exist, or mv will complain and refuse to move the files.
Inside a project directory,
mv is atomic and near-instantaneous (i.e it will not copy files and then delete the original copies).
The technology1 used to implement quota limits for project directories cause each project directory to appear as a different file system to Linux. This means that
mv will not be atomic and near-instantaneous, instead
mv will in turn copy each file and delete the original copy (just as is would when moving files between physically separate disks).
This is significant when you move large amounts of data. If
mv is being interrupted while it is running, any file not yet moved will remain in the original directory and any file already moved will remain in the destination directory. Restarting
mv after being interrupted in this way is usually not possible, and you will end up having to recover manually, usually using cp+rm or rsync.
Due to this behavior, we recommend always using rsync when moving a large amount of data between two project directories. The example below will safely move (and rename to
dataset42) the directory tree
Please note that you will need read access to all data that is being copied. If this is not the case, rsync will complain, but continue and copy the files that it can. So please check the rsync output carefully. If you do not have read access, or want to preserve file ownership, NSC will need to do the transfer for you. In this case, contact NSC Support to discuss your options.
In the example below, we will ask rsync to preserve as many properties of the moved files as possible. Please note that some things (e.g file ownerships) cannot be preserved unless you run rsync as root (which NSC would then have to do for you).
If you know that some of these options are not needed (e.g if you know you have no hard links or sparse files), you can omit those options, this will speed up the transfer. You can find the full definition of these options in the rsync man page (run
man rsync to read it). Please note that
-a is shorthand for
-a, --archive archive mode; equals -rlptgoD (no -H,-A,-X) -r, --recursive recurse into directories -l, --links copy symlinks as symlinks -p, --perms preserve permissions -t, --times preserve modification times -g, --group preserve group -H, --hard-links preserve hard links -S, --sparse handle sparse files efficiently -v, --verbose increase verbosity -n, --dry-run perform a trial run with no changes made --delete delete extraneous files from dest dirs
Check that the destination directory does not already exist
First, create our rsync command and test it using the
--dry-run option (will not actually copy anything):
rsync -aHSv --dry-run /proj/someproject/users/x_abcde/dataset/ /proj/anotherproject/shared/dataset42
Please note the trailing slash ("/") in ".../dataset/". This is important when using rsync (as opposed to many other Linux commands). The reason is explained in the rsync man page.
If you see nothing strange in the output, run it without the
--dry-run option to actually copy the files.
rsync -aHSv /proj/someproject/users/x_abcde/dataset/ /proj/anotherproject/shared/dataset42
Now, have a quick look to see that the copied files are actually present in the destination directory. We also do a quick sanity check by comparing the total size of the directories.
ls -lR /proj/anotherproject/shared/dataset42 du -sh --apparent-size /proj/someproject/users/x_abcde/dataset/ /proj/anotherproject/shared/dataset42
If you feel happy with the result, now is the time to remove the original files:
rm -rvf /proj/someproject/users/x_abcde/dataset
If you are moving a large volume of data, it might take hours or days to copy it. If the directory tree is in active use and being written to this can be a problem.
This method allows you to make an initial copy of the data, then stop all accesses to the original files and run one final rsync (which will be much faster since it only needs to copy data that has changed since the first rsync).
If you're the only person with access to the files, you can simply stop writing to it. :) If telling other people to stop writing to it is not an option, you can change the permissions of the top level directory so only you can access it (
chmod go= /proj/someproject/users/x_abcde/dataset). Another option is to rename the top level directory (
mv /proj/someproject/users/x_abcde/dataset /proj/someproject/users/x_abcde/dataset.hidden).
You can simply-re-run the same rsync command you used for the initial copy. However, if files have been deleted in the original directory tree, those will not be deleted on the destination side. You can ask rsync to delete files on the destination side not present in the original directory. THIS CAN BE DANGEROUS, so be very careful when using
To safely use
--delete in our example, first run once using
rsync -aHSv --delete --dry-run /proj/someproject/users/x_abcde/dataset/ /proj/anotherproject/shared/dataset42
Then, if the output looks OK (i.e no unexpected new files, updates or deletions), run again without
--dry-run to update the destination copy and delete any files removed in the original copy:
rsync -aHSv --delete /proj/someproject/users/x_abcde/dataset/ /proj/anotherproject/shared/dataset42
An example of how to NOT use
--delete (the destination directory
/proj/anotherproject/shared is all wrong, the command will delete EVERYTHING in
/proj/anotherproject/shared except the files we are copying):
#DO NOT RUN THIS# rsync -aHSv --delete /proj/someproject/users/x_abcde/dataset/ /proj/anotherproject/shared
If this all sounds complicated and scary, NSC can help with large or complicated file transfers within Centre Storage, contact NSC Support to discuss your options.
Rsync uses only a single thread to copy data. Sometimes this can be a bottleneck. Also, when copying data to and from the cluster over SSH, SSH will usually only use a single CPU core for encrypting the data, which can create a bottleneck if the network and the remote computer are fast enough.
fpsync tries to solve these two problems. It works by calling
fpart to split a directory tree into chunks. Each chunk is then handed over to a separate rsync process for transfer.
Since a chunk has to contain at least one file, this limits the possible speedup if you have very few files.
However, fpsync have some issues you need to be aware of:
If you want to try fpsync, start by reading the man page (
man fpsync) and the example below.
If you're unsure about how to use fpsync or how many concurrent jobs is safe to run, feel free to contact NSC Support.
In this example, we will use a compute node to copy a large directory tree from one project directory to another.
Note: This directory tree is not very large (126 GiB, 4018 files). In real life most people would probably just use rsync to copy it, but it is large enough to show the performance boost you can get from fpsync.
For this particular directory tree, the optimum number of concurrent rsync processes turned out to be just four.
If you are going to copy the same or similiar directory trees many times, it might pay off to do some tests to determine what gives the best result for that specific tree. If not, I recommend using a low number (e.g 4).
First, allocate a compute node:
[kronberg@tetralith0 ~]$ interactive -N1 --exclusive -t 24:00:00 salloc: Granted job allocation 10409514 srun: Step created for job 10409514 [kronberg@n190 ~]$
Copy the data using rsync and see how long it takes:
[kronberg@n190 ~]$ time rsync -aHS /proj/nsc/users/kronberg/fpsynctest/s1/ /proj/nsc-guest/users/kronberg/fpsynctest/rsync1 real 12m10.337s
While this is running, we can login to the node using
jobsh and check CPU usage using
top. In this case we can see that significant amounts of CPU is used by two rsync processes (one reading, one writing, in total using between 0.2 and 1.5 CPU cores) and "mmfsd" (system process that communicates with the storage system, using less than 0.5 CPU cores).
The CPU usage and distribution between rsync and mmfsd will vary over time depending on e.g the size of the files being copied.
Now, we run fpsync with 2, 4, and 8 workers/concurrent processes to see what the performance will be:
[kronberg@n190 ~]$ time fpsync -n 2 -o "-aHS" -O "-b" /proj/nsc/users/kronberg/fpsynctest/s1 /proj/nsc-guest/users/kronberg/fpsynctest/fpsync-2 real 9m33.082s [kronberg@n190 ~]$ time fpsync -n 4 -o "-aHS" -O "-b" /proj/nsc/users/kronberg/fpsynctest/s1 /proj/nsc-guest/users/kronberg/fpsynctest/fpsync-4 real 6m4.310s [kronberg@n190 ~]$ time fpsync -n 8 -o "-aHS" -O "-b" /proj/nsc/users/kronberg/fpsynctest/s1 /proj/nsc-guest/users/kronberg/fpsynctest/fpsync-8 real 17m22.176s
Note: there are several rsync options that can have a significant performance impact.
If you know that you don't have any hard links in your data (links created with "ln" rather than "ln -s"), you can skip the '-H' option.
If you know that you have no sparse files in your data, you can skip the "-S" option.
IBM Spectrum Scale "filesets"↩