Overview of Publisher

The Publisher system in its current form allows you to make data on Bi/Accumulus available to users anywhere in the world.

The published data is a read-only copy of the original data. Published data cannot be changed, only deleted (automatically after a certain time, or manually by the person who published it). Published data is not updated when the original data changes.

You can publish data that is stored on any of the shared filesystems on Bi (e.g /home, /nobackup/*, but not /scratch/local).

You always publish a directory tree with all its contents. If you need to publish a single file, create an empty directory and put the file in it, then publish the directory.

The current system has a capacity of approximately 20TiB published data (shared between all users, no quota).

The Publisher system is connected to the Internet and to NSC systems with a 1Gbps network (so the maximum combined in/out transfer speed will be ~100MB/s).

Quickstart

  • Put the data you want to publish into a directory. In this example we use the directory "mytestdata".
  • Choose the publication area to use (see table of available areas below). In this example we use the area "tmp_rossby".
  • Run pcmd mytestdata tmp_rossby. Sample output:
[sm_mkola@analys1 ~]$ pcmd mytestdata tmp_rossby
Checking dataset......
Generating sha1sum......
data
     f9910632ba63c554ee7ba95c4eb8f0618e4bd986
Checking dataset file sizes  --> OK

Publication created with ID: tmp_rossby.74
Export url: http://exporter.nsc.liu.se/b7b00058ad424381909938b0492ffb28, rsync://exporter.nsc.liu.se/b7b00058ad424381909938b0492ffb28

[sm_mkola@analys1 ~]$ 
  • Note the "Export URL" output. This is the address you should send to the recipient of the data.
  • Run "pcmd -v" until the job is no longer listed.
  • Access the published data using http or rsync. E.g open http://exporter.nsc.liu.se/b7b00058ad424381909938b0492ffb28 in your browser or tell your rsync client to download rsync://exporter.nsc.liu.se/b7b00058ad424381909938b0492ffb28
  • If you lose the export URL, you can always list all your published datasets using "pcmd -qv".
  • You can delete published data using pcmd -r DATASET_ID, where DATASET_ID is the identifier listed by e.g "pcmd -qv" or the "Publication ID" given when the data was published (e.g "tmp_rossby.74"). Some areas are configured to automatically delete datasets after a certain time.

What happens when you publish data?

  1. You run "pcmd" on the Bi login node. You need to tell pcmd what directory you want to publish, and to what "publishing area" (e.g "tmp_rossby") you want to publish it. ("pcmd -h" will show all options available when using pcmd)
  2. The system verifies that you are allowed to publish data to the selected publishing area.
  3. pcmd creates a file containing the SHA1 checksum of all files that will be published. This checksum is used by the Publisher system to verify that data was correctly transferred and can also be used by the end-user who downloads published data to verify that all files are intact. (This step is optional, but most publishing areas will use checksumming.)
  4. "pcmd" queues a transfer job. You can check the status of all ongoing transfers using the command "pcmd -j".
  5. Note: the pcmd command will exit as soon as it has performed its checks and created the checksum file. At this point the data is not yet published! Do not delete the data until it has been successfully published (see below)
  6. The publisher system transfers the data to the export server using rsync.
  7. When all data has been transferred and the checksum has been verified (optional), the export server makes the data available to external users over one or more of the supported protocols (currently http and rsync).
  8. The transfer job is removed (no longer visible when you run "pcmd -j").
  9. You can check all your published data sets using "pcmd -qv".
  10. Remember that you need to notify the persons who will be downloading the data that data is available and what address (URL) to use, e.g http://exporter.nsc.liu.se/b7b00058ad424381909938b0492ffb28 for HTTP download or rsync://exporter.nsc.liu.se/b7b00058ad424381909938b0492ffb28 for rsync download.

Publication areas

  • Publishing to a publication area is always restricted - only members of a certain group can do it.
  • A publication area can be configured to delete published data after a certain time, or not at all. Data can always be manually deleted by the person that published it.
  • The URL can either be in on the "public" format where you choose a suitable name (e.g http://exporter.nsc.liu.se/rossby/anamethatyouchoose), or on the "secret" format, e.g http://exporter.nsc.liu.se/c690d383de8940308bf3c9f9cbd6e132. The advantage of the secret format is that it is very hard to use guessing or brute-force to find the address to data you're not supposed to have access to. However, this is still a weak security mechanism - anyone who finds out the URL (e.g by checking your browser history or snooping on your network) can download the data.
  • A publication area can be accessible via http and/or rsync (note: the native rsync protocol on port tcp/873 is used, not rsync over SSH).
  • A publication area can be configured to limit the type of data that can be published on it. This feature is currenly very limited, only these checks are supported:
    • minSize - all files must be bigger than N bytes
    • maxSize - all files must be smaller than N MB
    • netCDF - all files must be netCDF files (only the file suffix is checked)
    • README - all published datasets must contain a file named README
  • If you need a new publishing area, please contact smhi-support@nsc.liu.se to discuss this.

Available publication areas.

You can always see the actual list of publication areas using the command pcmd -l. The list below is not guaranteed to be up to date.

Name Unix groups allowed to publish Protocols URL type Datasets automatically deleted after (days) Limits
tmp_foua sm_foua http,rsync secret url 30 max file size 1TB
tmp_foul sm_foul http,rsync secret url 30 max file size 1TB
tmp_fouo sm_fouo http,rsync secret url 30 max file size 1TB
tmp_foup sm_foup http,rsync secret url 30 max file size 1TB
tmp_bpom sm_bpom http,rsync secret url 30 max file size 1TB
tmp_ml sm_ml http,rsync secret url 30 max file size 1TB
tmp_mo sm_mo http,rsync secret url 30 max file size 1TB
tmp_misu misu http,rsync secret url 30 max file size 1TB
tmp_rossby rossby http,rsync secret url 30 max file size 1TB
tmp_kthmech kthmech http,rsync secret url 30 max file size 1TB
tmp_miuu miuu http,rsync secret url 30 max file size 1TB
rossby_sc roadmin http,rsync secret url no max file size 1TB
rossby_pr roadmin http,rsync user-selectable name 7 max file size 1TB

Using pcmd

Getting help:

[x_makro@analys1 ~]$ pcmd -h
usage:  pcmd publPath publArea or
        pcmd [options]

With no options, pcmd will publish publication at 'publPath' to 'publArea'.
With options, publPath and publArea should be omitted. Please see the
Publisher User Guide for more information:
http://www.nsc.liu.se/systems/publisher/

options:
  -h, --help            show this help message and exit
  -n, --version         Shows version
  -v, --verbose         Shows verbose info
  -s, --status          Shows the status of the supplied publicationId
  -j, --listjobs        Lists the ongoing jobs
  -p, --poll            Shows the status of published files
  -l, --list            Lists all publication areas
  -a, --area            Displays the area  definition of the supplied areaname
  -q, --query           Query datasets
  -u USER, --user=USER  Specifies the user to query (used together with -q)
  -d DATE, --date=DATE  Specifies the date to query (YYMMDD or YYMMDD-YYMMDD)
                        (used together with -q)
  -r, --remove          Removes a publication
[x_makro@analys1 ~]$ 

Exporting a directory:

[x_makro@analys1 ~]$ pcmd -l
Available publication areas
------------------------------------------------------------------------------
Name          Prot         Auth        Days   Url                               
tmp_misu      http, rsync  secret url  30     http://exporter.nsc.liu.se        
tmp_nsc       http, rsync  secret url  30     http://exporter.nsc.liu.se        
tmp_mo        http, rsync  secret url  30     http://exporter.nsc.liu.se        
publtest-1... http, rsync  secret url  1      http://exporter.nsc.liu.se        
tmp_rossby    http, rsync  secret url  30     http://exporter.nsc.liu.se        
publtest      http, rsync  None        0      http://exporter.nsc.liu.se        
tmp_foup      http, rsync  secret url  30     http://exporter.nsc.liu.se        
system        http, rsync  None        1      http://exporter.nsc.liu.se        
tmp_foua      http, rsync  secret url  30     http://exporter.nsc.liu.se        
tmp_foul      http, rsync  secret url  30     http://exporter.nsc.liu.se        
tmp_fouo      http, rsync  secret url  30     http://exporter.nsc.liu.se        
tmp_ml        http, rsync  secret url  30     http://exporter.nsc.liu.se        
bigtest       http, rsync  None        0      http://exporter.nsc.liu.se        
tmp_miuu      http, rsync  secret url  30     http://exporter.nsc.liu.se        
tmp_bpom      http, rsync  secret url  30     http://exporter.nsc.liu.se        
rossby_sc     http, rsync  secret url  0      http://exporter.nsc.liu.se        
rossby_pr     http, rsync  None        7      http://exporter.nsc.liu.se        
tmp_kthmech   http, rsync  secret url  30     http://exporter.nsc.liu.se        
nsctest-1d... http, rsync  secret url  1      http://exporter.nsc.liu.se        

[x_makro@analys1 ~]$ pcmd mydata tmp_misu
Checking dataset......
Generating sha1sum......
file1
        82251aabadb525ee709a4a04a30c0e07448ea314
bigfile1
        b1260761b0c32c95edd0f0f8d95322ceae96d0e7
dir1/file2
        9927b517a5710aa8bf6a9fbfba76e6722114f2f5
dir1/file3
        06b82608e19bfb693afe56730db4103dc987b076
Checking dataset file sizes  --> OK

Publication created with ID: tmp_misu.675
Export url: http://exporter.nsc.liu.se/f7b70d36e0504a2b93a9cf55ed10581e, rsync://exporter.nsc.liu.se/f7b70d36e0504a2b93a9cf55ed10581e

[x_makro@analys1 ~]$ pcmd -j
Active jobs
----------------------------------------------------------------------
tmp_misu.675   In transfer                 vagn:/home/x_makro/mydata
[x_makro@analys1 ~]$ pcmd -j
Active jobs
----------------------------------------------------------------------
tmp_misu.675   In transfer                 vagn:/home/x_makro/mydata
[x_makro@analys1 ~]$ pcmd -j
Active jobs
----------------------------------------------------------------------
tmp_misu.675   Performing checksum test    vagn:/home/x_makro/mydata
[x_makro@analys1 ~]$ pcmd -j
Active jobs
----------------------------------------------------------------------
tmp_misu.675   Exporting                   vagn:/home/x_makro/mydata
[x_makro@analys1 ~]$ pcmd -j
Active jobs
----------------------------------------------------------------------
[x_makro@analys1 ~]$ 

Checking all your published data:

[x_makro@analys1 ~]$ pcmd -q
Available Publications
----------------------------

tmp_misu.675 ---------------------------------------------------------
http://exporter.nsc.liu.se/f7b70d36e0504a2b93a9cf55... Exported                 

[x_makro@analys1 ~]$ 

[x_makro@analys1 ~]$ pcmd -q -v
Available Publications
----------------------------

tmp_misu.675 ---------------------------------------------------------
status    Exported                                                              
area      tmp_misu                                                              
url       http://exporter.nsc.liu.se/f7b70d36e0504a2b93a9cf55ed10581e, rsync://exporter.nsc.liu.se/f7b70d36e0504a2b93a9cf55ed10581e
prot      http, rsync                                                           
auth      secret url                                                            
source    vagn:/home/x_makro/mydata                                             
user      x_makro                                                               
time      Wed Mar 13 16:11:26 2013                                              

Download and verify exported data (from anywhere in the world) using rsync:

kronberg@ming ~/tmp $ rsync -av rsync://exporter.nsc.liu.se/f7b70d36e0504a2b93a9cf55ed10581e ./mydata_downloaded

receiving incremental file list
created directory ./mydata_downloaded
./
CHKSUM.SHA1
bigfile1
file1
dir1/
dir1/file2
dir1/file3

sent 156 bytes  received 209741472 bytes  27965550.40 bytes/sec
total size is 209715492  speedup is 1.00

kronberg@ming ~/tmp $ (cd mydata_downloaded && sha1sum --check CHKSUM.SHA1)
dir1/file3: OK
bigfile1: OK
file1: OK
dir1/file2: OK
kronberg@ming ~/tmp $ 

Download a single file using http:

kronberg@ming ~/tmp $ wget -q http://exporter.nsc.liu.se/f7b70d36e0504a2b93a9cf55ed10581e/bigfile1
kronberg@ming ~/tmp $ ls -l bigfile1
-rw-rw---- 1 kronberg kronberg 209715200 Mar 13 16:01 bigfile1

Recirsively download and verify a dataset (from anywhere in the world) using HTTP:

kronberg@ming ~/tmp $ wget -q -e robots=off -r --no-host-directories --no-parent http://exporter.nsc.liu.se/f7b70d36e0504a2b93a9cf55ed10581e/
kronberg@ming ~/tmp $ (cd f7b70d36e0504a2b93a9cf55ed10581e && sha1sum --check CHKSUM.SHA1)
dir1/file3: OK
bigfile1: OK
file1: OK
dir1/file2: OK
kronberg@ming ~/tmp $ 

Deleting data

x_makro@analys1 ~]$ pcmd -q
Available Publications
----------------------------

tmp_misu.675 ---------------------------------------------------------
http://exporter.nsc.liu.se/f7b70d36e0504a2b93a9cf55... Exported                 

[x_makro@analys1 ~]$ pcmd -r tmp_misu.675
Dataset tmp_misu.675 put on queue for deletion
[x_makro@analys1 ~]$ pcmd -q
Available Publications
----------------------------

tmp_misu.675 ---------------------------------------------------------
http://exporter.nsc.liu.se/f7b70d36e0504a2b93a9cf55... Deleted                  

[x_makro@analys1 ~]$ 

Limitations and unexpected behaviour

Permitted file types

A dataset may only contain files and directories. If the directory tree (dataset) that you try to publish contain any other types of data such as symbolic links and sockets, pcmd will display an error message and exit.

This is a design choice and not a bug or technical limitation.

Deleting published data

Published data can be deleted using "pcmd -r". When a dataset is deleted, it is no longer accessible to users.

However, information about the published dataset are retained in the database and can be displayed using e.g "pcmd -qv" (the data set is displayed as "Deleted").

The URL used by a deleted data set can not be reused for another (i.e you cannot publish some data as http://server/area/my-latest-data and then replace it with updated data next week).

Publishing files that are not readable by "other"

Note: this behaviour is a bug or undocumented behaviour in Publisher, it will change in a future version.

When you publish a directory tree, the permissions of the files are copied along with the files. If you export files or directories that are only accessible by "user" or "group" they will not be accessible after having been exported.

Workaround: make sure that all files and directories are accessible to "other" before publishing them, e.g by running chmod -R o+rX <DIRECTORY>

Availability

Publisher is not designed to be a high-availability system. It can be considered to be approximately as reliable as an NSC cluster login node (e.g Gimle).

In practise, this means:

  • Publisher will be unavailable for a few minutes when operating system updates are applied (might happen anywhere from once a week to a few times per year)
  • Publisher will be unavailable for a few minutes when we add, remove or modify publication areas (might happen a few times per year).
  • Publisher might be unavailable from a few hours to a day or two if a major hardware problem occurs (there is only one Publisher server, if it fails we have to move the service to another server or repair the broken one).

Published, non-deleted datasets are backed up daily to tape for disaster recovery purposes.

The Publisher internal database that keep tracks of all metadata is backed up to disk hourly and to tape daily.

If this level of availability is not enough for your needs, store your data elsewhere, or contact NSC to discuss how we can improve Publisher.

How to get help

If you need help using Publisher, if something does not work as expected, or if you have any other questions, please send an email to the normal support address .

Footnotes: