Viedumu Vietne ar Sandi

2011-07-6

How to delete insane amount of files in Linux

Filed under: Tech — Sandis @ 23:46

In near past I was presented with the interesting problem in the Linux OS world. The standard solutions of this problem in standard cases involve two or three commands to end troubles instantly. However, to my surprise, Linux was not prepared for any solutions to this particular case. This post shares my solution.

The problem description sounds too easy: need to delete files in one directory. But not just some amount of files – insanely huge amount of files, more than 10 millions of them. Maybe the backstory of “why so much” is interesting, but lets keep to the point. Linux tools, intended to execute file managing commands, can hold and process only definite list of files. However, when file list is too large, all standard command line tools fail to operate because of their internal limits. I mean, filesystem (ext3, ext4, ufs) can hold large amount of files – several millions of inodes usually – and you can access and move them if you do it by calling full path to each individual file. However, if you have files with random names (maybe even in different path depths), you have no tool to manage them.

Lets start from the beginning with a discovery of large amount of small sized files and immediate need to get rid of them.

The first idea for solution was pretty simple: /dir/$ rm -f *
But it failed. Rm is bash based and it can store only ~4000 items, then it reports “too much files”.

Hey, why not list all the files and then delete them one by one? Like: “ls | xargs rm” or “find /dir/ -type f -exec rm”? Surely, but “ls” tries to make huge list and then exec the result and “find” reads files in random order, creates list and applies some kind of filter mechanism, making removal painfully slow. How much time and ram needs to be consumed to store the results of 10M+ files in a list and then loop again to remove them? Very _very_ long and too much of it.

I can’t blame those tools, they are supposed to create lists. But then there are no standard tools that can help in current situation. Without reading list, I can’t limit list of files for rm anyhow. Google provided me with alternate solution: how about just wipe all disc and filesystem and start fresh over. Yeah, right.

I came up with a very quick and simple solution as c++ program that bypasses this need of listing and succeeds to remove files in constant time. The alternative idea is NOT to make custom new list like rm, find and ls does, because filesystem already has its own file list. Filesystem inode data elements are readable one-by-one and it is possible to request all inodes in order from certain exact path. Such file access is already used by “rm” when calling unlinking (removal) of exact file.

Using the source code from “rm” I managed to create my own tool, that seems to be performing better in solving my million file problem (at least – better than rm, find and ls). It takes path as an input variable, reads all file inodes associated with that path one by one and issues unlink command.

If file removal is not so straightforward as removing all files, it would seem to be possible to implement some sort of filter by reading more information about file from inode relation before removal.

Source code (cpp) file is linked at the end of this post, if you want to see. This post is intended to share this code, so, if anyone needs to solve same problem the way I did – here it is. Also, maybe somebody has ideas how to improve it.

Performance-wise removing files by accessing inodes works bad (slow) if less than half of million files (<500 000) are involved. To find out how this algorithm performs against other tools, I did some speed tests on computer with 2G of ram (actually my work computer at the time).

Here is testing time with time command in removing 100 000 files from path:

Method: $ time rm -f /test/* $ time find /test -type f | xargs rm -f $ time find /test -type f -delete $ time ./wipe test
real 0m6.018s 0m2.444s 0m1.833s 0m2.930s
user 0m0.550s 0m0.140s 0m0.070s 0m0.080s
sys 0m2.160s 0m2.100s 0m1.620s 0m1.630s

This test case provided interesting results even for me. The standart “rm -f /test” proved to be the worst solution taking time three times as fast as the fastest result: “find /test -type f -delete”. My solution was only half as bad.
For a note: I also did testing with 10 000 files and results placed these methods in the same place as for 100 000.

Now, with 500 000 files in one path:

Method: $ time rm -f /test/* $ time find /test -type f | xargs rm -f $ time find /test -type f -delete $ time ./wipe test
real fail 4m52.405s 4m45.492s
user fail 0m0.400s 0m0.280s
sys fail 0m9.370s 0m9.630s

The rm command did fail and I did not test find + rm, because it seemed slower than find -delete. Since generating half a million files for test was too long, I proceeded to test only my program against the fastest solution there was.
I would call the result even, because 5 minutes is 5 minutes and few seconds is not too much of a difference.

With 10M files in one path, there is no table to present.
“find /test -type f -delete” just clogged up memory and produced no visible results for a quite a time, so I had to kill it. Even attempted rerun several times. Perhaps, in hardware with very large operational memory size, results are different, but in my case (or computers case) 10M was it for “find”.
Wipe-tool did produce visible results and did use 1% of cpu.
It took about 24 hours total to completely remove all 10M files. In first hours the results seemed optimistic with speed of removed ~200 inodes/second, promising whole job to finish in about 16 hours. However, after few hours, this speed dropped to ~100 inodes removed per second. I think it has to do something with process priority and cpu/disc usage from parallel processes. On positive side: “Wipe” did not seem to use more than 1% of cpu and around 0,1% of ram from 2G.

So there it is – a solution to remove >=10M files without much impact on base system. The time could be faster, but I will take what I have at the moment. Maybe anyone has ideas?

The source code can be compiled with standard “g++ -o wipe wipedir.cpp” (tested on xubuntu). This is not the original code, used for solving the problem, but written from scratch, differs from original (uses the same idea) and algorithm-wise is the same. Current usage: run program with directory name relative to running program as argument.
Source: http://dl.dropbox.com/u/34521211/wipedir.cpp

2 komentāri »

  1. I had too much file, much more than 500 000 and this is on production server which is high traffic. When I tried use “find” I’ve lost all my free memory after several minutes of it’s work. The only way I found is to write files removal with a simple script:

    <?php

    $dir = "/path/to/dir/with/files";
    $dh = opendir( $dir);

    $i = 0;
    while (($file = readdir($dh)) !== false) {
    $file = "$dir/$file";
    if (is_file( $file)) {
    unlink( $file);
    if (!(++$i % 1000)) {
    echo "$i files removed\n";
    }
    }
    }

    This do what I want and not use all my memory. If it will perform CPU or disk operations overloading it can be tuned to add sleep() or usleep() in removal cycle.

    Best regards,
    Mike

    Komentārs no Mikhus — 2012-03-5 @ 13:46

  2. […] How to delete insane amount of files in Linux | Viedumu … – Jul 06, 2011 · I had too much file, much more than 500 000 and this is on production server which is high traffic. When I tried use “find” I’ve lost all my free …… […]

    Atbalsojums no How To Fix 500 Failed To Delete The File Errors - Windows Vista, Windows 7 & 8 — 2014-11-2 @ 11:59


RSS feed for comments on this post. TrackBack URI

Atbildēt

Fill in your details below or click an icon to log in:

WordPress.com logotips

You are commenting using your WordPress.com account. Log Out /  Mainīt )

Twitter picture

You are commenting using your Twitter account. Log Out /  Mainīt )

Facebook photo

You are commenting using your Facebook account. Log Out /  Mainīt )

Connecting to %s

Veidojiet bezmaksas vietni vai emuāru vietnē WordPress.com.

%d bloggers like this: