When you’re running low on space on a file system, that can cause various unexpected behavior of the system, depending on which file system is filling up. For me, when that happens, I usually first issue a disk free (df) to see which is the file system that is almost full. Once I know which file system, I go and search which files take up the most space in that file system and take action. Sometimes, df show that a file system is almost full while, when summing up all the space by all files doesn’t even come near that value.
When the space that is reported by df is different then what you see whit du, it’s usually caused by a file that was deleted but wasn’t release by a certain process. This means that when you would reboot the system or stop the process that’s holding the deleted file, the space would be released. This sounds very easy and it is but often you can’t just reboot the machine or even restart the offending process because that service should be always there or just can’t be missed at that moment.
When you can’t stop the process holding the file, one option could be to grow the file system using LVM but at a certain moment, your space is all used and you can’t keep growing your file system. Especially when using XFS, this is not recommended since you can’t shrink after expanding.
The best solution, which can be used in most of the cases, is to go and find which file is still kept on the file system, but marked deleted, and to delete the contents of the file without removing the file descriptor. Be warned that some processes write data to a deleted file on purpose so overwriting the contents of that file could potentially cause problems for the process owning the file descriptor.
Simulate the problemand create a difference between df and du
First, let’s have a look how the situation, which is the initial problem, looks like. To simulate a process holding a file while it is deleted, I’ll create a small C program that opens a file for reading and then waits infinite.
To be able to compile the source of our program, we need a compiler:
For RHEL, CentOS and derivatives::
[jensd@cen1 ~]$ sudo yum install gcc ... Complete !
For Debian, Ubuntu and dderivatives
jensd@deb1:~$ sudo aptitude install gcc
Next, we need to create a file, lets say openfile.c, that contains the source of the small program. The program simply opens the file that is given as an argument, throws and error if there was a problem open it, and then just goes into an infinite loop until the program is stopped.
#include <stdio.h> #include <fcntl.h> int main(int argc, char* argv[]) { int fd_in; /* open the file */ fd_in=open(argv [1], O_RDONLY); if (fd_in == -1) { perror ("Problem opening the file..."); } /* infite loop with sleep to not use too much cpu */ while(1) { sleep(5); } return (0); }
After creating the source file openfile.c, we need to compile it with our just-installed compiler: (the command is the same for CentOS and Debian):
[jensd@cen1 ~]$ vi openfile.c [jensd@cen1 ~]$ gcc openfile.c -o openfile
The output of the above command is the executable file openfile. Now that we have our test-program, I’ll create a ~800MB file that will fill up my /var filesystem:
[jensd@cen1 ~]$ dd if=/dev/zero of=/var/tmp/testfile bs=1MB count=800 800+0 records in 800+0 records out 800000000 bytes (800 MB) copied, 3.63599 s, 220 MB/s [jensd@cen1 ~]$ ls -lh /var/tmp/testfile -rw-rw-r--. 1 jensd jensd 763M Oct 21 21:40 /var/tmp/testfile
Now that we have the file, I’ll let it be opened in the openfile program and check the status of the /var file system:
[jensd@cen1 ~]$ ./openfile /var/tmp/testfile & [1] 3183 [jensd@cen1 ~]$ df -h /var Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg_sys-lv_var 997M 941M 57M 95% /var
When we check where the space is used on the /var file system with du, we see what we would expect:
[jensd@cen1 ~]$ sudo du -h --max-depth=1 /var 38M /var/lib 5.2M /var/log 0 /var/adm 52M /var/cache 16K /var/db 0 /var/empty 0 /var/games 0 /var/gopher 0 /var/local 0 /var/nis 0 /var/opt 0 /var/preserve 28K /var/spool 814M /var/tmp 0 /var/yp 0 /var/kerberos 0 /var/var 0 /var/crash 908M /var
When we delete the file now, it will still be in use by our C-program but we won’t be able to see where the space has been used with du:
[jensd@cen1 ~]$ rm /var/tmp/testfile [jensd@cen1 ~]$ df -h /var Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg_sys-lv_var 997M 941M 57M 95% /var [jensd@cen1 ~]$ sudo du -h --max-depth=1 /var 38M /var/lib 5.2M /var/log 0 /var/adm 52M /var/cache 16K /var/db 0 /var/empty 0 /var/games 0 /var/gopher 0 /var/local 0 /var/nis 0 /var/opt 0 /var/preserve 28K /var/spool 51M /var/tmp 0 /var/yp 0 /var/kerberos 0 /var/var 0 /var/crash 145M /var
As you can see, after deleting the file, the space hasn’t been freed up since df is still showing that the /var file system is 95% full. Usually at this point is where you get alerted and when checking with du, we see that it reports that only 145MB of file exist in /var.
Investigate the problem and find the process which is keeping deleted files
Now that we, finally, simulated the situation where df and du are reporting different value, I can show how we can find which process is holding the space.
A minimal RHEL or CentOS installation doesn’t normally contain the lsof program, which is a key in solving this problem so we’ll have to install it. For Debian this step shouldn’t be needed.
[jensd@cen1 ~]$ sudo yum install lsof ... Complete !
Now let’s search for processes that keep files open which are marked as deleted:
[jensd@cen1 ~]$ lsof|grep deleted
openfile 3183 jensd 3r REG 253,2 800000000 3592 /var/tmp/testfile (deleted)
As you see, lsof reports us that openfile is keeping a file with name /var/tmp/testfile that is about 800MB in size and is held open by a process openfile with pid 3183.
Now that we know which process it is, we can determine if we could stop the process or not to free up the space. In case you can safely stop or restart the process, the space should be freed.
Solve the problem and free up the space without stopping the process
In case you can’t stop the program that is holding up the space, there is another “trick” to free up the space. As I warned before, this could potentially break things in the process or make you loose data. In some cases, this is the only option because with a full file system, the system will have general problems and there is a big chance that the process keeps running fine.
In the above step, we got to know the pid of the process holding the file. With that information, we can try to find the file descriptor for the deleted file in the /proc file system:
[jensd@cen1 ~]$ ls -l /proc/3183/fd total 0 lrwx------. 1 jensd jensd 64 Oct 21 21:42 0 -> /dev/pts/0 lrwx------. 1 jensd jensd 64 Oct 21 21:42 1 -> /dev/pts/0 lrwx------. 1 jensd jensd 64 Oct 21 21:42 2 -> /dev/pts/0 lr-x------. 1 jensd jensd 64 Oct 21 21:42 3 -> /var/tmp/testfile (deleted)
In the above command, you need to replace 3183 with the reported pid from lsof. In the output we can see that the deleted file is represented by file descriptor 3.
To free up the space, we can simply write null to the file descriptor. This makes the size of the file that is kept 0 and the space will be freed without stopping the process:
[jensd@cen1 ~]$ cat /dev/null > /proc/3183/fd/3 [jensd@cen1 ~]$ pgrep openfile 3183 [jensd@cen1 ~]$ df -h /var Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg_sys-lv_var 997M 178M 820M 18% /var
The first command write null to file descriptor 3 of process with pid 3183. The second command shows that after that action, the process didn’t stop while taking the first action. The third command shows that our space is free up, just as we wanted.
As said before this “trick” should be used as a last resort but it can save you in some cirtical situations in an environment that can’t afford downtime.
It worked like a charm! Thanks so much man!
Very good explain!! Thanks!!
# gcc -o openfile openfile.cc
openfile.cc: In function ‘int main(int, char**)’:
openfile.cc:19:17: error: ‘sleep’ was not declared in this scope
sleep(5);
^~~~~
you need to include
#include