FFV - an asynchronous file transfer tool

Contents

Description

ffv is an asynchronous file transfer tool for moving large amounts of data locally or remotely. It has been developed by NSC for the purpose of simplifying moving of data between Ekman and Vagn, but of course also works between other hosts if available.

The asynchronous nature of ffv makes it possible to detach the process of moving data from other types of work.

The idea is simple: when submitting a job, a script is created on disk, containing all the code needed (though it depends on external binaries) to handle the moving of data and all the necessary error handling. The script is constructed such that only one process at a time may execute certain critical functions within it, allowing itself to be executed from cron every minute without causing any problems. This allows ffv to execute the job without having the submitting user to be logged in, it can also retry execution when detecting certain transient errors.

So you submit a job that describes what to move and where to move it, while also letting ffv be able to authenticate on its own (see security) without ffv storing any necessary credentials itself. After the job has been created ffv takes care of the rest until the job is finished, an error occurs or authentication is needed. In its simplest form a job is created as simple as:

  ffv submit /path/to/large/directory/tree otherhost:/path/to/dest/dir

(or the other way around of course) resulting in a job id that is used as a handle for this job on this host. And that's it. After submission you just have to wait for a mail (optional) telling what happens with the job and take proper action to that.

Features

How does it work?

ffv actually uses rsync for the file transfer and jobs are started from cron. Normal workflow is something like this:

  1. At job submission: move source file/directory into a temporary spool directory within the same directory.

  2. rsync the spool directory until successful and no files has been transferred anymore (state transfer).

  3. rsync the spool directory using checksum information instead of file metadata until successful and no files has been transferred anymore (state verify).

  4. At destination: move spool directory to target destination, at source (unless --no-source-delete): remove temporary spool directory (state cleanup).

Restrictions

Requirements

Local requirements:

On the remote host (if any) ffv requires:

Security

Example

Example usage when moving a file tree from ekman to vagn. This requires passphrase protected public key login to vagn - see http://www.nsc.liu.se/systems/snic/security.html. A man page with examples is also available (man ffv):

Job creation

  1. (On workstation) load private key into key agent:
          $ ssh-add -c
    
  2. (On workstation) log in to ekman with agent forwarding (-A) and forwarding of Kerberos tickets:
          $ ssh -A -K -o GSSAPIDelegateCredentials=yes -o GSSAPIKeyExchange=yes -o GSSAPIAuthentication=yes myuser@ekman.pdc.kth.se
    
  3. (On Ekman) load the default ffv module
          $ module load beta-modules
          $ module load ffv
    
  4. (On Ekman) submit a ffv job that moves all files within ~/data/ to /home/vagnuser on vagn (user vagnuser). A job id is returned:
          $ ffv submit -u vagnuser ~/data vagn.nsc.liu.se:/home/vagnuser
          0.myuser.ekman
    

    This job id is a unique handle for the job at ekman for the user myuser. 0.myuser.ekman may be shortened to 0 or 0.myuser when using ffv as myuser on ekman.

  5. by default progress is sent asynchronously by mail to myuser@ekman (which I redirect using ~/.forward), and to a file at ~/ffv.0.myuser.ekman.out

Other operations

Note that while a job is ongoing, data is temporary moved into a spool directory that must not be tampered with or you may have incomplete data at the destination host after the job has finished.

FAQ

  1. Q: When I issue a command to cancel jobs in state auth-revert I get a message that they are not cancellable. Why?

    A: The jobs are already cancelling, you need to run 'ffv auth' on them in order to complete the cancelling. The reason authentication is needed is for ffv to be able to remove the temporary (source or target) spool directory on the remote host

    In general, jobs in states with prefix 'auth-' are paused until user executes 'ffv auth' and jobs in states with prefix 'paused-' are paused until the user executes 'ffv resume'.

  2. Q: How do I submit ffv jobs non-interactively, say from within a batch queueing system if needing access to a remote host?

    A: You need to manually create a SSH control socket that is used by ffv. Example:

    $ SOCK=/tmp/$USER.$$.sock
    $ ssh -Nn -f -o ControlMaster=yes -o ControlPath="$SOCK" myuser@other.host
    or even shorter:
    $ ssh -MNnf -S "$SOCK" myuser@other.host
    $ ffv submit -S ControlPath="$SOCK" -u myuser /PATH/TO/DATA/SET other.host:/REMOTE/PATH
    

    As long as "$SOCK" is alive, new ffv jobs may be submitted without any manual confirmation (i.e. you don't need to acknowledge access to local SSH key agent).

    Also read the man page for ssh_config about ControlMaster and ControlPath. The man page for ffv also have some more information about this.

    Warning: When using sockets this way you should be aware that if your account on this host is stolen, any such existing socket owned by you would be usable to get access to your account on other.host too. That's why you shouldn't let these sockets exist forever.

  3. Q: How do I submit ffv jobs non-interactively, from the batch queueing system on ekman?

    A: You need a control socket (see above) and forward kerberos credentials the correct way:

    Use ekman-rsync as file transfer node, log in to ekman-rsync and create a socket to vagn for ffv:

    $ ssh -A -K -o GSSAPIDelegateCredentials=yes -o GSSAPIKeyExchange=yes -o GSSAPIAuthentication=yes username@ekman-rsync.pdc.kth.se
    $ echo $$
    28881
    $ ssh -c arcfour -MNnf -S /tmp/"$USER".28881.sock myuser@other.host
    

    This socket will be available until the SSH session is torn down or the socket is removed. Please remove when not used anymore. Note: due to a bug in the SSH client on ekman-rsync you need the specify an explicit cipher to use (in this case arcfour) that is not MT-AES-CTR or else forking to background will cause SSH clients to hang (see: http://www.psc.edu/networking/projects/hpn-ssh/), another workaround is to not use -f and instead background from the shell instead.

    Log in to ekman, create and submit a batch script that will transfer /REMOTE/PATH from other.host to other.host:/PATH/TO/DIR folder on ekman-rsync:

    $ ssh -A -K -o GSSAPIDelegateCredentials=yes -o GSSAPIKeyExchange=yes -o GSSAPIAuthentication=yes username@ekman.pdc.kth.se
    $ cat testrun.sh
    #!/bin/bash
    ssh -K ekman-rsync << EOF
    module load beta-modules
    module load ffv
    ffv submit -S /tmp/"$USER".28881.sock -u myuser other.host:/REMOTE/PATH /PATH/TO/DIR
    EOF
    
    $ esubmit -n 1 -t 1 $PWD/testrun.sh
    

    Submit as many batch scripts as you want; as long as the socket is still alive the ffv jobs will start without doing an initial SSH handshake (i.e. there's no interactive step). And of course, if your job data would be on a local file system you would first have to copy it to a global filesystem on ekman before creating the ffv job.

    Note 1: you must enable forwarding of kerberos tickets to ekman-rsync (-K) in the batch script or else the ffv transfer will not work.

    Note 2: testrun.sh could also be written as:

    #!/bin/bash
    ssh -K ekman-rsync 'module load beta-modules; module load ffv; ffv submit -S /tmp/"$USER".28881.sock -u myuser other.host:/REMOTE/PATH /PATH/TO/DIR'
    
  4. Q: I think my ffv jobs have "frozen" in state verify - they are not showing any progress for an hour, what has happened?

    A: If the jobs contains very large datasets, the underlying rsync process will take a long time before showing any progress to ffv. Be patient, if something goes wrong ffv will notify you. You might check yourself if anything happens by using the trace command 'strace' on the underlying rsync process (see output from ffv status JOBID).

    If you are sure that it really has "frozen" you can restart it with:

    $ ffv pause JOBID
    $ ffv resume JOBID
    

    Where the text JOBID above should be replaced with the real job id.

Download