Shared File System
GPFS has been decommisioned!
Please see our page on the new filesystem: Lustre
The Steno cluster uses the General Parallel File System (GPFS) from IBM as the shared file system between all the nodes. Currently there is one GPFS file system for storing user home directories.
Backup
Backup is a best-effort, free-of-charge service. We only backup this filesystem once a month.
Quota
Each group has a limited amount of space which is enforced by the file system. If this limit is exceeded, the whole group
will not be able to write new data.
You can check the current use with:
% mmlsquota -g <your group> users
Local scratch
Jobs can use the local scratch space on the nodes to optimize execution. Each job has its
own directory on each node. The path is:
$SCRATCH
Best practice
The shared file system is a limited resource and you might disrupt other users if you do not
follow these guide lines. If your jobs cannot be serviced
by the available resources on the shared file system, your group need to make new investments.
Optimizing a job's use of the file system can be an annoying task. However, in most cases it will make your job run faster.
A basic example
Let look at a very common example of computation. We have a program that reads from an input file and produces output to another file:
./program input output
Obviously, there are many special cases for specific jobs, but you should be able to use the example as a template.
The program reads a small bit of the input file, compute this, and produce a small bit of output.
Then repeat this procedure, until it reaches the end of the input file. This generates many small
I/O operations, which could be avoided by reading and writing large chunks instead.
Often it is not possible to change the program to do the right thing. This is where the local scratch file system comes to rescue:
cp input $SCRATCH
./program $SCRATCH/input $SCRATCH/output
cp $SCRATCH/output .
This way you copy the input file with a minimum of I/O operations. The shared file system is not touch during
computation and the output is copied back to it with a minimum of I/O operations.
Chances are that you input and output files will fit in the page cache of Linux. This will speed up your job even further.
When to copy to local scratch
It is not easy to determine when it makes sense to copy to local scratch. Depending on the load on the shared
file system, it will take between 10s and a few minutes to copy a 1GB file.
In most cases it does not hurt to do the copy. However, if the job only needs small parts of a big file it might not
make sense to copy it all. Then again, if you are reading from it at random positions, copying might be the way to go.
stdout and stderr
The standard output and error is usually stored on the shared file system. If your program generates lots of
output these, then please redirect it to local scratch.
GPFS Performance
The GPFS file system is build for performance. And while we are able to get very impressive peak performance numbers from the current setup, there are severe limitations which users should be aware of and try to avoid where possible.
- GPFS does not like small files. A lot of small files and deep directory structures requires GPFS to cache a lot of data, which in turn takes up quite a lot of memory on the nodes. It also requires massive amounts of metadata network traffic.
- Multiple threads writing to the same GPFS directory structure seems to decrease performance dramatically. This appears to be due to locking and only happens when a significant number of threads are doing it simultaneously.
- The file systems are limited by the Gigabit Ethernet infrastructure between the nodes. This is especially a problem on the frontend machines where a large number of users are generally working at the same time.
- GPFS does not like a lot of open files, and recursive find's through the file system. Please limit both where possible.
|