In near past I was presented with the interesting problem in the Linux OS world. The standart solutions of this problem in standart cases involve two or three commands to end troubles instantly. However, to my surprise, Linux was not prepared for any solutions to this particular case. This post shares my solution.
The problem description sounds too easy: need to delete files in one directory. But not just some amount of files – insanely huge amount of files, more than 10 millions of them. Maybe the backstory of “why so much” is interesting, but lets keep to the point. Linux can’t hold indefinate count of files. But when it does so close to limits, linux should deal with this problem. Ext3 and Ext4 is advertised to handle large amount of data, but somehow the tools fail to properly manage it. Surely, system can store and serve when specifically asked, but cannot list or wipe them easily.
The first idea for solution was pretty simple: /dir/$ rm -f *
But it failed. Rm is bash based and it can store only ~4000 items, then it reports “too much files”.
Hey, why not list all the files and then delete them one by one? Like: “ls | xargs rm” or “find /dir/ -type f -exec rm”? Surely, but “ls” tries to make huge list and then exec the result and “find” reads files in random order, creates list and applies some kind of filter mechanism, making removal painfully slow. How much time and ram needs to be consumed to store the results of 10M+ files in a list and then loop again to remove them? Very _very_ long and too much of it.
I can’t blame those tools, they are supposed to create lists. But then there are no standart tools that can help in current situation. Without reading list, I can’t limit list of files for rm anyhow. Google provided me with alternate solution: how about just wipe all disc and filesystem and start fresh over. Yeah, right.
I came up with a very quick and simple solution as c++ programm that bypases this need of listing and succeeds to remove files in constant time. The filesystem already has its own file list and the tool just has to read it one by one and delete. Inode data is readable and therefore I can read list elements one by one. By learning from rm source code, it calls standart command “unlink” to deal with file removal. Using those facts and standart library I managed to create my own tool, that seems to perform better than rm, find and ls.
My point is not to debate over weakneses or strengths of operating system tools: I want to share this solution so people can rewrite my bad code and create even more faster code or just use it. The test results of 100 000 and 500 000 generated files proves my solution is not perfect for small amount of files (<500 000). However, if you have more than 500 000 files, it is faster and stable.
Test results. The "wipe" command is my tool.
Case with 100 000 files in one dir:
$ time rm -f /test/*
$ time find /test -type f | xargs rm -f
$ time find /test -type f -delete
$ time ./wipe test
The 100 000 files test case provided interesting results even for me. The standart "rm -f /test" proved to be the worst solution taking time three times as fast as the fastest result: "find /test -type f -delete". My solution was only half as bad.
Case with 500 000 files in one dir:
Since I also did tests with 10 000 files with insignificant results, I found out the fastest and the slowest tools do not change places, so I just run my solution against the fastest option for 100 000 files (it took too much time to generate test files :D).
$ time find /test -type f -delete
$ time ./wipe /test
So now results look better for my solution. And since I know "find" does not produce visible results and just clogs up memory for 10M files, I suspect performance time increments even faster for it with more files.
As for problem solving, this solution took ~24 hours to completely wipe all files from that dir. In first hours the results seemed a bit more optimistic as ~200 inodes were freed in one second, promising this whole job would take about 16 hours. However, later this speed dropped to ~100 inodes per second. I think it has to do something with process priority and cpu/disc usage from parallel processes. "Wipe" did not seems to use more than 1% of cpu and around 0,1% of ram from 2G.
The source code can be compiled with standart "g++ -o wipe wipedir.cpp" (tested on xubuntu). This is not the original code, used for solving the problem, but written from scratch, differs from original (uses the same idea). Current usage: run programm with directory name relative to running program as argument.