Systems  
Status displays
System status
Retired systems
 
 
 
 
 
 
 
 
 
 
 
 

Tape Storage Access Guide

1 Quick-Start Guide

Set up SSH agent forwarding. Presuming you start at your local workstation or laptop (which you should) do:

  1. $ ssh-keygen # Follow on-screen directions. Set a strong password
  2. $ scp $HOME/.ssh/id_rsa.pub your_username@target_server:
  3. $ ssh your_username@target_server
  4. your_username@target_server $ cat $HOME/id_rsa.pub >> $HOME/.ssh/authorized_keys && chmod 600 $HOME/.ssh/authorized_keys
  5. <logout>
  6. $ ssh-add -c -t 8h # Prompts for the key password. loads your key into the forwarding agent
  7. $ ssh -A your_username@target_server #You should be automatically logged in without password dialog
  8. your_username@target_server $ echo $SSH_AUTH_SOCK # Verifies that agent forwarding is working. Should return something like /tmp/ssh-XXXXXXXXXX/agent.XXXX

That should do it. Of course replace 'your_username' and 'target_server' as appropriate in the above. The '-A' switch to ssh is crucial to be able to access the tape front-end. When this works notify 'smhi-support@nsc.liu.se' and ask for access to the tape storage.

Assuming you have got access set up on the tape storage front end you can access it (with krypton as an example below) like:

your_username@krypton $ lftp -u your_username, sftp://hubble.nsc.liu.se #The comma is intentional

2 Physical Setup

The tape storage access at NSC is provided by means of a front-end server interfacing back-end servers directly communicating with the tape robot. The name of the front-end server is hubble.nsc.liu.se and this name is the only practical aspect of the physical setup you as the end user need to remember. More on that below. It is important, however, to know this setup in order to have realistic expectations on the accessibility of your stored data in terms of time of retrieval and upload.

Let's take an example: Suppose you want to retrieve a 20 GB file from the front end. The request for retrieval is handed to the back-end server which tells the tape robot to fetch the tape containing the file and wind the tape to the correct offset. This can take up to 2 minutes provided there are tape drives available. The tape drive then reads the tape with a nominal data rate of 80 MB/s – 140 MB/s (at the time of writing) depending on what technology generation of tape the data resides. I write nominal since if the data is fragmented on the tape there will have to be some overhead due to winding the tape backward/forward. This overhead can be significant.

In the transfer of the data to the front-end (hubble, remember) there is another bottleneck to consider, which is the network speed. The current 1GB ethernet can not realistically move data faster than a rate of ~ 100 MB/s. So to continue our example let's assume you get a data transfer rate of 100 MB/s. Your 20 GB file (20 000 MB) will then be transferred in ∼ 3 minutes to the front-end. When the data is fully retrieved to the front-end it will start to be uploaded to its destination, that is to you sitting on a cluster likely. From the NSC systems you can probably get a data transfer rate of 40 – 90 MB/s from the front-end depending on your choice of encryption cipher. This then gives you an additional lag of 3 – 7 minutes before your 20 GB file arrives, making a grand total of 8 – 12 minutes to get the 20 GB file. This is a best case scenario, do note.

3 Data Integrity Considerations

In the storing of tape data there is no checksumming of your files except at the network transfer stages, meaning that you are on your own when it comes to protecting your data integrity on tape. Unfortunately there is precious little you can do in terms of data protection except to store your data in the doublecopy group. Effectively you store your data in two copies on different tapes this way. This should not be considered a backup solution since a disastrous event in the tape robot can very likely destroy both copies of your data. Possibly you can view this as a half insured car.

Regardless how you choose to store your tape data you will need a means to judge its integrity. This is best done using checksums where you checksum all data files individually and store a textfile of the files and corresponding checksums together with the data itself. The details of our recommended way to do this is outlined below.

3.1 Tape Archive File Creation

The instructions assumes you have a directory hierarchy which is to be archived in its entirety. An alternative way to do this based on a file list will also be outlined. First create the list of checksums. This example uses 'sha1sum'. The command 'md5sum' also works well and may be used interchangeably.

$ find . -type f -maxdepth 1 -exec sha1sum '{}' \; > sha1sums.txt

This assumes you are only interested in the files of the current directory. Omitting the '-maxdepth' option and argument will make 'find' traverse the entire file tree from this directory and down. Next create the archive.

$ find . -type f -maxdepth 1 | tar czf archive.tar.gz -T -

Note that this archive will contain the sha1sums.txt file, thus making it self-contained with respect to integrity checking. It is however good practise to also keep a second copy of the sha1sums.txt file on the front-end as a backup if it would get damaged in the archive.

3.2 Tape Archive Integrity Checking

Suppose you have downloaded a crufty old archive created in accord with the above instructions, how do you go about verifying the archived files' integrity? This is where the checksum file comes in. The 'sha1sum' and 'md5sum' commands can take a checksum file, created in the above manner, as an argument and verify the integrity of the files listed within the checksum file. Following the above example, this is how.

$ cp sha1sums.txt sha1sums.txt.bak # Make a backup if you need to
$ tar xzf archive.tar.gz # Extract the archive
$ diff sha1sums.txt sha1sums.txt.bak # Verify file integrity
$ sha1sum -c sha1sums.txt # Run the checksum check

Be careful to not lose the second copy of the sha1sum.txt file when extracting the archive by making a copy. The checksum utility will give you a report on each file checked and a summary at the end.

4 Accessing the Tape Library Front-End

As mentioned in the "Physical Setup" section the front-end server for accessing the tape library is called hubble. Hubble has a special SSH setup which will only allow connections using the SFTP protocol and where you will be dropped in a shell on what for all intents and purposes here is a normal FTP server. Users are sparated on a unix group basis, e.g. rossby, smhid etc., and can not access the files of other groups. Immediately where you're dropped there will be two directories of note 'nobackup' and 'dblcopy' under which you can upload/download your files. We recommend that you create a directory under each of these using your NSC user name to use for data transfer to lessen the risk of confusion. The 'nobackup' directory is a non-backuped directory as the name suggests and the dblcopy directory will have automatic data duplication to different tapes on all its content. This is no backup though since these tapes reside in the same physical tape library.

Authentication when accessing hubble is done by means of SSH key pairs, which will need to be generated and verified to work before proceding. The FTP client software recommended to use is lftp as detailed below.

4.1 SSH Agent forwarding

A prerequisite to use SSH agent forwarding is to set up SSH key pair logins to all places you connect to on your way to an NSC SMHI cluster. Starting to use SSH key pairs for authentication is very simple.

First create the key pair on you local laptop or workstation:
me@my_machine:~$ ssh-keygen

Follow the on-screen dialog that appears, accept the defaults. Choose a strong and unique password for the generated keys when that option appears. A good rule of thumb is to use at least twelve characters using lower- and uppercase letters, digits and punctuation characters (.!?_- etc.).

If you follow the instructions later in this text, a difficult to remember password will not be especially troublesome for you, so chose a really strong password. The keys created are put by default in $HOME/.ssh/ and are called id_rsa and id_rsa.pub (also by default), the latter is the public part of the keys. The private part of the key should only be kept on your local laptop or workstation to prevent anyone but you getting their hands on it.

Then, of the two pairs in the SSH key, put the public part in a specific file on all hosts you want to use key login to.

The following commands will do the right thing on all NSC systems, and most likely on the wast majority of SSH enabled systems in the world. I will use Krypton as an example from here on. First copy the public key part to Krypton.

me@my_machine:~$ scp ~/.ssh/id_rsa.pub your_username@krypton.nsc.liu.se:
# Login to Krypton and append the public key to your "authorized_keys" file
your_username@krypton:~$ cat id_rsa.pub >> ~/.ssh/authorized_keys

… and you are done. Next time you log in to Krypton via ssh you will be using your fresh keys.

From a usability point of view though, nothing much has changed yet, you will still be prompted for your password every time you log in, albeit the password to your private key. To increase your comfort you can therefore have your laptop do it automatically for you using an "ssh-agent". On your laptop do
me@my_machine:~$ ssh-add

You will be prompted for the password to your ssh keys and this will be kept in your local machine's memory and all subsequent logins to foreign hosts (where you have prepped your $HOME/.ssh/authorized_keys file like above) will be automatic from your perspective, no password required.

Now, if this convenience wasn't already compelling, there's more. It is called agent forwarding and this is needed to access the tape storage front-end. Adding the option "-A" to all your ssh commands will allow you to safely perform "chained" logins (very bad security if not using agent forwarding), that is for instance mymachine > gimle > vagn > ekman. Basically agent forwarding tells ssh to look backwards in the chain and authenticate with the origin, in this case your local laptop or workstation. All this without entering your password.

If you want to be a really good, security-minded ssh user you should also add the options "-c" and "-t" to your "ssh-add" command. The "-c" option will prompt you to acknowledge every use of the key so you can correlate it against your activities and the "-t" option takes an argument consisting of a time of validity for your session. For instance, the argument "9h" will let the ssh-agent work for you for nine hours.

4.2 FTP Client Access

The preferred FTP client used on NSC clusters is lftp. This is a sophisticated ftp client capable of for instance recursive copying of directory trees. The features of lftp is beyond the scope of this guide, for those you will need to check the man page of lftp.

Accessing hubble with lftp is done with:
$ lftp -u your_username, sftp://hubble.nsc.liu.se #The comma is intentional, leave it there

As a backup you can use
$ sftp hubble.nsc.liu.se

which will use the much less sophisticated sftp client shipped with the ssh software.






Page last modified: 2012-08-31 12:08
For more information contact us at info@nsc.liu.se.