* System updates
** Learning about the need to update
*** Watch vendor announcement lists
*** Watch general security lists
*** Be part of a community that can afford to watch more specific resources
** Do we care?
*** Triage
Sample scale:
1. Update later (scheduled or together with more important stuff)
2. Update as soon as we get an update in our preferred format
3. Take care of it now in one way or the other!
*** Caution: "Important" for the vendor, "Critical" for you
A lot of security advisories are not tuned for our environment with a lot
of general users with shell access. A "local root exploit" is critical to us!
*** Updating the triage
**** Generally, things get more broken as time goes by
**** Sometimes: lucky breaks in the other direction
"Ah, but the bug is only triggered if you use feature X"
"But this is an effective workaround"
** Problems getting updates
*** Waiting for repackagers
Example: CentOS
*** Waiting for turn-key solutions
Example: proprietary Linux + cluster software repackaging
*** Waiting for third-party dependencies
Example: Mellanox OFED, proprietary file system clients
*** Waiting for binary downloads
Example: a lot of big scientific packages
*** Figuring out stuff built from source at site
Example: a lot of big scientific packages
** Solutions for getting updates
*** If it hurts, don't do it
Limit dependency on turn-key solutions, third-party dependencies as far as possible.
Make vendors understand our pain. Again and again. Until they understand.
*** Prepare for building own packages
Example: For RHEL/CentOS set up mock environment, do test builds of
likely/hard components (kernel, glibc, ...)
*** Document source builds so they can be redone by others
Really good documentation.
Automatic build scripts/systems may help...
... but not if not too complicated ("Yeah, I wanted to rebuild FooBarLab last week,
but I do not understand WizzbangAutoBuilder, so...").
*** Make sure you have the credentials, licenses etc needed to download vendor updates when you need them
** Workarounds
Can we test that it works? Do we trust it?
*** Workarounds for the specific vulnerability
*** General workaround tactics
**** systemtap
Example: perf bug recently
** How to deploy updates
*** Login nodes and system servers
**** Reboot not needed
Updates can often be applied right away. Make sure enough stuff is restarted... 
**** Reboot needed
Multiple login servers help.  Multiple system servers helps too (if
redundancy works). Otherwise, make sure a quick reboot does not cause
user problems (lost jobs, failed file operations, ...)
*** Compute Nodes
**** Reboot not needed
Updates may be applied right away, but...
... do you care about OS jitter? Can we do work in the job epilog?
... are you sure the update persists? (more on this later)
**** Reboot needed
Automate rolling updates (rebooting as soon as current job is done)!
Keep users out of nodes they do not run jobs on, if you allow user login to nodes.
Can we accept the risk that user X still can login to node Y that is not yet updated?
If not, shut user login access to nodes.
**** Node health checking
Please do in prolog/epilog. It might not be all that security related,
but it saves a lot of other trouble...
We can add test for "security problem X fixed" if we want...
** What If We Cannot Get Updates or Workarounds?
Decide beforehand when it will be appropriate to shut down access.
Get management acceptance.
*** Levels
1. No new logins accepted, but do not kick out those who are logged in
2. No logins accepted, logged in users kicked out, jobs keep running
3. Jobs killed too?
*** Keep users and management informed
Without clear information, "We have temporarily blocked logins to
Triolith while we are investigation how to secure the system for a
serioussecurity vulnerability that was suddenly released on the Internet
today" becomes "I dunno, I heard they shut down Triolith because it
was cracked or something."
* System configuration management
How do we keep the system configuration consistent on nodes and servers?
This is important for security, but also for functionality and performance.
** Special problems for compute nodes
*** Clean reinstall on every boot (diskless or even with node disks)?
Good: you always know what you have on the nodes.
Bad: nodes already zapped when you want to do forensics. Store logs elsewhere?
*** How to install nodes (or node images)?
**** Scripted install (kickstart etc) on every boot
**** Scripted install (kickstart etc) when needed, reboot from disk otherwise
**** Scripted install (kickstart etc) for making node image, use that when booting
**** Manually installed node image, use that when booting
**** (and some other combinations)
*** Deployment at scale
Letting thousands of servers wget/rsync from a few system servers might not work.
Solutions: Bittorrent? Multicast?
** Configuration consistency on servers and compute nodes (or images)
How do we make sure we do not forget vital configuration?
*** Checklists
*** Scripts
*** Configuration management tools
Puppet, Chef, Cfengine, Bcfg2, ....

Warning: abstracting too much may make systems administration harder.
Strive for the right balance!
*** Version control on top of this
Now, you also know how it was four months ago!
** Logging what you have done
Still useful, even with configuration management tools
(but may partly be the commit log).
May be as simple as a date-ordered text file on the system server.
** Package your own tools, scripts etc
Example: Instead of some slightly different scripts in /root/bin on four clusters,
we might aim for versioned RPM package on internal repo server, with source in git.

Problem: should not be overdone...
* The waterfall model of trust
Have a clear picture of the direction trust flows in. For example:
OK: admin desktop -> system server -> compute node
OK:  admin desktop -> system server -> login node
Bad: admin desktop -> login node -> system server
Bad: console at login node -> vital infrastructure server

Explain reasoning to all system staff.
Yes, it may be inconvenient at times.
Yes, it may save your cluster at intrusion time.

Enforce using account filters, firewalls, etc. More on that
may be said in a later session.
** Separate servers
Keep the users on the login nodes. Keep system servers separate from login nodes.

On large systems, divide system servers more (virtual or physical) if possible.

Example: file system servers may require kernel versions with known local root exploits
for weeks or forever. Restrict access to them!
** Separate credentials
Do not use the same password or similar for different levels in the waterfall.
Do not use the same root password on different clusters.

Goal: nothing you steal at a lower level should gain you access at a higher level.
Problem to get it to 100%! Example:

- Only staff can login to system servers. But staff homedirs are mounted on login nodes
  and compute nodes. If you get root on them, become a staff members, change .profile,
  get to run code on system server (or boobytrap "ssh", "su", ...)

Be careful with SSO!
* Hardening
There are a lot of things we can do. Some may conflict with ease-of-use,
some not. Let's discuss some examples:
** SUID stripping
Example: NSC antisuid, with a whitelist and a blacklist.
** File system flags
*** read-only mounting
*** nosuid, nodev
*** root squash
*** atime - performance vs. forensic abilities
** Module loading
Can we turn it off completely?
Can we at least turn off autoloading of modules?
** Security frameworks/patches: SELinux and similar
Any success stories?
Introducing third-party dependencies may make updating harder.
* Help the users proactively
There are a lot of things we can do. We should not forbid ourselves
from doing it in our policies...

We defer log analysis, forensics etc for a later session. Example of things
you can do periodically:
** Check for really bad filesystem permissions
** Check for bad SSH usage
*** Unencrypted private keys
*** Bad keys in authorized_keys (remember the Debian Debacle?)
* Preparing for monitoring, intrusion detection and forensics
I will let the people responsible for later tracks pick up this thread :-)