Viedumu Vietne ar Sandi

2011-07-6

How to delete insane amount of files in Linux

Filed under: Tech — Sandis @ 23:46

In near past I was presented with the interesting problem in the Linux OS world. The standart solutions of this problem in standart cases involve two or three commands to end troubles instantly. However, to my surprise, Linux was not prepared for any solutions to this particular case. This post shares my solution.

The problem description sounds too easy: need to delete files in one directory. But not just some amount of files – insanely huge amount of files, more than 10 millions of them. Maybe the backstory of “why so much” is interesting, but lets keep to the point. Linux can’t hold indefinate count of files. But when it does so close to limits, linux should deal with this problem. Ext3 and Ext4 is advertised to handle large amount of data, but somehow the tools fail to properly manage it. Surely, system can store and serve when specifically asked, but cannot list or wipe them easily.

The first idea for solution was pretty simple: /dir/$ rm -f *
But it failed. Rm is bash based and it can store only ~4000 items, then it reports “too much files”.

Hey, why not list all the files and then delete them one by one? Like: “ls | xargs rm” or “find /dir/ -type f -exec rm”? Surely, but “ls” tries to make huge list and then exec the result and “find” reads files in random order, creates list and applies some kind of filter mechanism, making removal painfully slow. How much time and ram needs to be consumed to store the results of 10M+ files in a list and then loop again to remove them? Very _very_ long and too much of it.

I can’t blame those tools, they are supposed to create lists. But then there are no standart tools that can help in current situation. Without reading list, I can’t limit list of files for rm anyhow. Google provided me with alternate solution: how about just wipe all disc and filesystem and start fresh over. Yeah, right.

I came up with a very quick and simple solution as c++ programm that bypases this need of listing and succeeds to remove files in constant time. The filesystem already has its own file list and the tool just has to read it one by one and delete. Inode data is readable and therefore I can read list elements one by one. By learning from rm source code, it calls standart command “unlink” to deal with file removal. Using those facts and standart library I managed to create my own tool, that seems to perform better than rm, find and ls.

My point is not to debate over weakneses or strengths of operating system tools: I want to share this solution so people can rewrite my bad code and create even more faster code or just use it. The test results of 100 000 and 500 000 generated files proves my solution is not perfect for small amount of files (<500 000). However, if you have more than 500 000 files, it is faster and stable.

Test results. The "wipe" command is my tool.

Case with 100 000 files in one dir:

$ time rm -f /test/*
real 0m6.018s
user 0m0.550s
sys 0m2.160s

$ time find /test -type f | xargs rm -f
real 0m2.444s
user 0m0.140s
sys 0m2.100s

$ time find /test -type f -delete
real 0m1.833s
user 0m0.070s
sys 0m1.620s

$ time ./wipe test
real 0m2.930s
user 0m0.080s
sys 0m1.630s

The 100 000 files test case provided interesting results even for me. The standart "rm -f /test" proved to be the worst solution taking time three times as fast as the fastest result: "find /test -type f -delete". My solution was only half as bad.

Case with 500 000 files in one dir:

Since I also did tests with 10 000 files with insignificant results, I found out the fastest and the slowest tools do not change places, so I just run my solution against the fastest option for 100 000 files (it took too much time to generate test files :D).

$ time find /test -type f -delete
real 4m52.405s
user 0m0.400s
sys 0m9.370s

$ time ./wipe /test
real 4m45.492s
user 0m0.280s
sys 0m9.630s

So now results look better for my solution. And since I know "find" does not produce visible results and just clogs up memory for 10M files, I suspect performance time increments even faster for it with more files.

As for problem solving, this solution took ~24 hours to completely wipe all files from that dir. In first hours the results seemed a bit more optimistic as ~200 inodes were freed in one second, promising this whole job would take about 16 hours. However, later this speed dropped to ~100 inodes per second. I think it has to do something with process priority and cpu/disc usage from parallel processes. "Wipe" did not seems to use more than 1% of cpu and around 0,1% of ram from 2G.

The source code can be compiled with standart "g++ -o wipe wipedir.cpp" (tested on xubuntu). This is not the original code, used for solving the problem, but written from scratch, differs from original (uses the same idea). Current usage: run programm with directory name relative to running program as argument.
Source: http://dl.dropbox.com/u/34521211/wipedir.cpp

2 komentāri »

  1. I had too much file, much more than 500 000 and this is on production server which is high traffic. When I tried use “find” I’ve lost all my free memory after several minutes of it’s work. The only way I found is to write files removal with a simple script:

    <?php

    $dir = "/path/to/dir/with/files";
    $dh = opendir( $dir);

    $i = 0;
    while (($file = readdir($dh)) !== false) {
    $file = "$dir/$file";
    if (is_file( $file)) {
    unlink( $file);
    if (!(++$i % 1000)) {
    echo "$i files removed\n";
    }
    }
    }

    This do what I want and not use all my memory. If it will perform CPU or disk operations overloading it can be tuned to add sleep() or usleep() in removal cycle.

    Best regards,
    Mike

    Komentārs by Mikhus — 2012-03-5 @ 13:46

  2. […] How to delete insane amount of files in Linux | Viedumu … – Jul 06, 2011 · I had too much file, much more than 500 000 and this is on production server which is high traffic. When I tried use “find” I’ve lost all my free …… […]

    Pingback by How To Fix 500 Failed To Delete The File Errors - Windows Vista, Windows 7 & 8 — 2014-11-2 @ 11:59


RSS feed for comments on this post. TrackBack URI

Komentēt

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Mainīt )

Twitter picture

You are commenting using your Twitter account. Log Out / Mainīt )

Facebook photo

You are commenting using your Facebook account. Log Out / Mainīt )

Google+ photo

You are commenting using your Google+ account. Log Out / Mainīt )

Connecting to %s

WordPress.com blogs.

%d bloggers like this: