Ashley Sheridan​.co.uk

Speed Testing the SPL Iterators for Fetching Files

Posted on

Tags:

In the world of PHP some techniques never really go away, despite there being better alternatives added to the core functionality. One of these areas is file iteration, particularly recursively through directories and their contents. It's fairly typical to see some code use a recursive function to:

  1. Scan a directory
  2. Loop through the scanned list
  3. Add entry to a list
  4. Check if an entry is a directory, if it is, go to step 1 with this directory as a start point

However, since PHP 5.4, it's possible to achieve all of this more simply with the SPL. This removes the hassle of creating recursive functions, and unless your needs are very complex, you can do a lot with a few lines of code.

I wanted to test a few different ways of using SPL to fetch a list of shell scripts from my ~/bin directory and (very roughly) benchmark them against each other.

Regex​Iterator

The first technique is the simplest, and works perfectly for fetching files that match a particular pattern, such as using a specific file extension.

$dir = new \RecursiveDirectoryIterator(__DIR__); $iterator = new \RecursiveIteratorIterator($dir); $files_iterator = new \RegexIterator($iterator, '/\.sh$/i'); foreach($files_iterator as $file) { echo $file->getPathname() . "\n"; }

The RegexIterator checks the file name against a regular expression (in this case, checking it ends with .sh). The resulting list of files is then looped over and output. The code is incredibly simple compared to its older predecessors, and will work for the majority of cases that you ever need code to recursively fetch a list of files from a given directory.

Recursive​Callback​Filter​Iterator

$dir = new \RecursiveDirectoryIterator(__DIR__); $filter = new \RecursiveCallbackFilterIterator($dir, function($current, $key, $iterator) { if ($iterator->hasChildren()) return true; if($current->isFile() && preg_match('/\.sh$/', $current->getFilename() ) ) return true; }); $files_iterator = new \RecursiveIteratorIterator($filter); foreach($files_iterator as $file) { echo $file->getPathname() . "\n"; }

This is more complicated than the first method, as it sets up a filter to apply to the iterator before recursing over the object and generating the file list. The main advantage this has over the previous is that it allows for more complex filtering than a regular expression allows.

Recursive​Filter​Iterator

The next method I tested was the RecursiveFilterIterator abstract class. As it's abstract, you have to create an instance of the class to use as a filter. This is overly complex for simply checking file names against a pattern, but it is very useful if your needs are more sophisticated.

class ShellFilter extends RecursiveFilterIterator { public function accept() { return $this->hasChildren() || preg_match('/\.sh$/', $this->current()->getFilename() ); } } $dir = new \RecursiveDirectoryIterator(__DIR__); $filter = new \ShellFilter($dir); $files_iterator = new \RecursiveIteratorIterator($filter); foreach($files_iterator as $file) { echo $file->getPathname() . "\n"; }

The class just needs to implement an accept() method that returns a boolean value that depends on whether or not to keep the file in the filtered list or not. Other than that, the logic you put inside your instance is up to you.

Like the RecursiveCallbackFilterIterator approach, this allows for more complicated iteration filters. As it's a class, it lends itself to cleaner code more easily than a callback function does.

Recursive​Iterator​Iterator Without a Filter

This approach is not recommended, as it has got a serious downside in that it can potentially create an infinite loop in your code (depending on the directory structure). More information is in a comment on the PHP manual site by someone named Sun a few years ago.

$dir = new \RecursiveDirectoryIterator(__DIR__); $iterator = new \RecursiveIteratorIterator($dir); foreach($iterator as $file) { $filename = $file->getFilename(); if(preg_match('/\.sh$/', $filename) ) echo $file->getPathname() . "\n"; }

I'm just using a regular expression on the file name, but any kind of check can be performed on the files. It's worth noting that here (and in most of the previous cases) I'm getting the file name using the getFilename() method of the file object. This is preferred over casting the file object to a string as that can have some serious memory implications, which Greg Sherwood highlighted in his blog article: An easy way to leak memory using DirectoryIterator.

Recursive opendir()

I've included this as a baseline to compare the new SPL methods to. It has the advantage of speed, by quite a long margin, but it does that at the expense of OOP practices, memory consumption, and testability. Note: I have not tested this on PHP 7, so the speed improvements that have been made in the PHP engine may well have lessened the difference in speed between opendir() and SPL.

function listdir($dir, $pattern) { $files = []; $fh = opendir($dir); while (($file = readdir($fh)) !== false) { if($file == '.' || $file == '..') continue; $filepath = $dir . '/' . $file; if (is_dir($filepath) ) $files = array_merge($files, listdir($filepath, $pattern) ); else { if(preg_match($pattern, $file) ) array_push($files, $filepath); } } closedir($fh); return $files; } $files = listdir(__DIR__, '/\.sh$/'); foreach($files as $file) { echo $file . "\n"; }

There is a lot more code to this than the other examples, mainly because you have to write from scratch what SPL gives you out of the box. The check for . and .. are absolutely vital, as these are considered directories and without the check, the code will attempt to recurse into them, and your script will run into a maximum function nesting level fatal error.

The Speed Test Results

To test the scripts properly, I ran all 5 consecutively 10,000 times each. To get good average readings, I repeated this process 20 times and took the mean from each test reaching a grand total of 200,000 tests! Overall, there are nearly 11 hours of benchmarking test results, which should be more than enough to give a decent picture of what to expect in terms of performance when using these in your own programs.

Code Tested Mean Seconds for 10,000 Iterations
Regex​Iterator 416
Recursive​Callback​Filter​Iterator 422
Recursive​Filter​Iterator 428
Recursive​Iterator​Iterator Without a Filter 443
Recursive opendir() 244

As you can see, the traditional method is quite a bit faster, but this running on PHP 5.6, so you can likely get much better results from PHP 7. What is not evident here is the memory consumption, which can often be a far more precious resource on a server, especially if your code is running on a shared server, where memory limits are not under your control. The SPL approach uses far less memory over putting your scanned filenames into an array, and allows more flexibility for navigating back and forth over the files.

Using OOP methods does also give you the ability to more easily create good unit tests, which are a must if you're doing anything slightly complex. The OOP benefits mean your code will be cleaner, more readable, and overall of a better quality.

Addendum

If you're interested, the spec of the test machine is as follows:

  • 3.3GHz Core i3
  • 6GB RAM
  • PHP 5.6

And the raw results for each of the 5 tests are:

Regex​Iterator Recursive​Callback​Filter​Iterator Recursive​Filter​Iterator Recursive​Iterator​Iterator Without a Filter Recursive opendir()
1 389 386 386 399 219
2 370 379 392 390 215
3 413 435 430 422 223
4 396 388 395 415 236
5 443 438 385 402 220
6 365 365 365 377 209
7 366 406 428 419 237
8 457 449 444 468 259
9 370 372 369 390 217
10 363 360 359 370 205
11 370 370 369 385 217
12 366 367 369 381 214
13 363 364 364 382 208
14 399 414 409 452 252
15 400 410 492 514 306
16 481 510 518 447 241
17 454 464 482 543 306
18 465 473 492 554 297
19 547 548 556 577 307
20 548 548 562 574 296

Comments

Leave a comment