HPC/UCPH
   
 

Shared File System

GPFS has been decommisioned!

Please see our page on the new filesystem: Lustre

The Steno cluster uses the General Parallel File System (GPFS) from IBM as the shared file system between all the nodes. Currently there is one GPFS file system for storing user home directories.

Backup

Backup is a best-effort, free-of-charge service. We only backup this filesystem once a month.

Quota

Each group has a limited amount of space which is enforced by the file system. If this limit is exceeded, the whole group will not be able to write new data.

You can check the current use with:

% mmlsquota -g <your group> users

Local scratch

Jobs can use the local scratch space on the nodes to optimize execution. Each job has its own directory on each node. The path is:

$SCRATCH

Best practice

The shared file system is a limited resource and you might disrupt other users if you do not follow these guide lines. If your jobs cannot be serviced by the available resources on the shared file system, your group need to make new investments.

Optimizing a job's use of the file system can be an annoying task. However, in most cases it will make your job run faster.

A basic example

Let look at a very common example of computation. We have a program that reads from an input file and produces output to another file:

./program input output
Obviously, there are many special cases for specific jobs, but you should be able to use the example as a template.

The program reads a small bit of the input file, compute this, and produce a small bit of output. Then repeat this procedure, until it reaches the end of the input file. This generates many small I/O operations, which could be avoided by reading and writing large chunks instead.

Often it is not possible to change the program to do the right thing. This is where the local scratch file system comes to rescue:

cp input $SCRATCH

./program $SCRATCH/input $SCRATCH/output

cp $SCRATCH/output .
This way you copy the input file with a minimum of I/O operations. The shared file system is not touch during computation and the output is copied back to it with a minimum of I/O operations.

Chances are that you input and output files will fit in the page cache of Linux. This will speed up your job even further.

When to copy to local scratch

It is not easy to determine when it makes sense to copy to local scratch. Depending on the load on the shared file system, it will take between 10s and a few minutes to copy a 1GB file.

In most cases it does not hurt to do the copy. However, if the job only needs small parts of a big file it might not make sense to copy it all. Then again, if you are reading from it at random positions, copying might be the way to go.

stdout and stderr

The standard output and error is usually stored on the shared file system. If your program generates lots of output these, then please redirect it to local scratch.

GPFS Performance

The GPFS file system is build for performance. And while we are able to get very impressive peak performance numbers from the current setup, there are severe limitations which users should be aware of and try to avoid where possible.

  • GPFS does not like small files. A lot of small files and deep directory structures requires GPFS to cache a lot of data, which in turn takes up quite a lot of memory on the nodes. It also requires massive amounts of metadata network traffic.
  • Multiple threads writing to the same GPFS directory structure seems to decrease performance dramatically. This appears to be due to locking and only happens when a significant number of threads are doing it simultaneously.
  • The file systems are limited by the Gigabit Ethernet infrastructure between the nodes. This is especially a problem on the frontend machines where a large number of users are generally working at the same time.
  • GPFS does not like a lot of open files, and recursive find's through the file system. Please limit both where possible.