Troubleshooting full filesystems where df and du disagree
We have a recurring problem with disk space being exhausted on the root filesystem of a system, the root cause of which is gnome-terminal holding open file-handles to very large deleted temporary files in /tmp. I suspect there is a bug in gnome-terminal not closing the handles to its scrollback buffer (possibly only when set to unlimited scrollback, as some users have).
The oddity with this is that df
will show the filesystem as 100% full with, e.g., 31GB used yet du -x
will only be able to account for, e.g., 0.5 of the total - suggesting only 16GB of files exist on the disk.
Diagnosing the stale file-handles requires breaking out lsof
and grep
ing for open deleted files, like this:
lsof | grep ' (deleted)$'
The output tells you the following information (in column order):
- process name
- process id (PID)
- username
- file descriptor number and mode (‘r’ for read, ‘w’ for write and ‘u’ for read and write)
- type (usually ‘REG’ for regular file in this case)
- device numbers separated by commas
- size
- node number
- name of the file
Once the offending files have been located, recovery can be effected by several methods - in increasing order of finesse (the last two were christened ‘axe’ and ‘scalpel’ by a colleague):
- sledgehammer -
reboot
the machine. Will close and reset all file handles. - axe -
kill
(-9
if necessary) the offending process. Will probably annoy the user whose process is holding open the files but, in the case of gnome-terminal at least, very effectively releases the space. - scalpel - use
truncate
to resize the file-handles to zero size, however this may destabilise the program if it has internally cached information about the file-size and does not have very robust file access code. The basic form of the command istruncate -s0 /proc/$PID/fd/$FD
where$PID
and$FD
can be found from columns 2 and 4 of thelsof
output.