scandir, a better directory iterator and faster os.walk() ========================================================= .. image:: https://img.shields.io/pypi/v/scandir.svg :target: https://pypi.python.org/pypi/scandir :alt: scandir on PyPI (Python Package Index) .. image:: https://travis-ci.org/benhoyt/scandir.svg?branch=master :target: https://travis-ci.org/benhoyt/scandir :alt: Travis CI tests (Linux) .. image:: https://ci.appveyor.com/api/projects/status/github/benhoyt/scandir?branch=master&svg=true :target: https://ci.appveyor.com/project/benhoyt/scandir :alt: Appveyor tests (Windows) ``scandir()`` is a directory iteration function like ``os.listdir()``, except that instead of returning a list of bare filenames, it yields ``DirEntry`` objects that include file type and stat information along with the name. Using ``scandir()`` increases the speed of ``os.walk()`` by 2-20 times (depending on the platform and file system) by avoiding unnecessary calls to ``os.stat()`` in most cases. Now included in a Python near you! ---------------------------------- ``scandir`` has been included in the Python 3.5 standard library as ``os.scandir()``, and the related performance improvements to ``os.walk()`` have also been included. So if you're lucky enough to be using Python 3.5 (release date September 13, 2015) you get the benefit immediately, otherwise just `download this module from PyPI `_, install it with ``pip install scandir``, and then do something like this in your code: .. code-block:: python # Use the built-in version of scandir/walk if possible, otherwise # use the scandir module version try: from os import scandir, walk except ImportError: from scandir import scandir, walk `PEP 471 `_, which is the PEP that proposes including ``scandir`` in the Python standard library, was `accepted `_ in July 2014 by Victor Stinner, the BDFL-delegate for the PEP. This ``scandir`` module is intended to work on Python 2.6+ and Python 3.2+ (and it has been tested on those versions). Background ---------- Python's built-in ``os.walk()`` is significantly slower than it needs to be, because -- in addition to calling ``listdir()`` on each directory -- it calls ``stat()`` on each file to determine whether the filename is a directory or not. But both ``FindFirstFile`` / ``FindNextFile`` on Windows and ``readdir`` on Linux/OS X already tell you whether the files returned are directories or not, so no further ``stat`` system calls are needed. In short, you can reduce the number of system calls from about 2N to N, where N is the total number of files and directories in the tree. In practice, removing all those extra system calls makes ``os.walk()`` about **7-50 times as fast on Windows, and about 3-10 times as fast on Linux and Mac OS X.** So we're not talking about micro-optimizations. See more benchmarks in the "Benchmarks" section below. Somewhat relatedly, many people have also asked for a version of ``os.listdir()`` that yields filenames as it iterates instead of returning them as one big list. This improves memory efficiency for iterating very large directories. So as well as a faster ``walk()``, scandir adds a new ``scandir()`` function. They're pretty easy to use, but see "The API" below for the full docs. Benchmarks ---------- Below are results showing how many times as fast ``scandir.walk()`` is than ``os.walk()`` on various systems, found by running ``benchmark.py`` with no arguments: ==================== ============== ============= System version Python version Times as fast ==================== ============== ============= Windows 7 64-bit 2.7.7 64-bit 10.4 Windows 7 64-bit SSD 2.7.7 64-bit 10.3 Windows 7 64-bit NFS 2.7.6 64-bit 36.8 Windows 7 64-bit SSD 3.4.1 64-bit 9.9 Windows 7 64-bit SSD 3.5.0 64-bit 9.5 CentOS 6.2 64-bit 2.6.6 64-bit 3.9 Ubuntu 14.04 64-bit 2.7.6 64-bit 5.8 Mac OS X 10.9.3 2.7.5 64-bit 3.8 ==================== ============== ============= All of the above tests were done using the fast C version of scandir (source code in ``_scandir.c``). Note that the gains are less than the above on smaller directories and greater on larger directories. This is why ``benchmark.py`` creates a test directory tree with a standardized size. The API ------- walk() ~~~~~~ The API for ``scandir.walk()`` is exactly the same as ``os.walk()``, so just `read the Python docs `_. scandir() ~~~~~~~~~ The full docs for ``scandir()`` and the ``DirEntry`` objects it yields are available in the `Python documentation here `_. But below is a brief summary as well. scandir(path='.') -> iterator of DirEntry objects for given path Like ``listdir``, ``scandir`` calls the operating system's directory iteration system calls to get the names of the files in the given ``path``, but it's different from ``listdir`` in two ways: * Instead of returning bare filename strings, it returns lightweight ``DirEntry`` objects that hold the filename string and provide simple methods that allow access to the additional data the operating system may have returned. * It returns a generator instead of a list, so that ``scandir`` acts as a true iterator instead of returning the full list immediately. ``scandir()`` yields a ``DirEntry`` object for each file and sub-directory in ``path``. Just like ``listdir``, the ``'.'`` and ``'..'`` pseudo-directories are skipped, and the entries are yielded in system-dependent order. Each ``DirEntry`` object has the following attributes and methods: * ``name``: the entry's filename, relative to the scandir ``path`` argument (corresponds to the return values of ``os.listdir``) * ``path``: the entry's full path name (not necessarily an absolute path) -- the equivalent of ``os.path.join(scandir_path, entry.name)`` * ``is_dir(*, follow_symlinks=True)``: similar to ``pathlib.Path.is_dir()``, but the return value is cached on the ``DirEntry`` object; doesn't require a system call in most cases; don't follow symbolic links if ``follow_symlinks`` is False * ``is_file(*, follow_symlinks=True)``: similar to ``pathlib.Path.is_file()``, but the return value is cached on the ``DirEntry`` object; doesn't require a system call in most cases; don't follow symbolic links if ``follow_symlinks`` is False * ``is_symlink()``: similar to ``pathlib.Path.is_symlink()``, but the return value is cached on the ``DirEntry`` object; doesn't require a system call in most cases * ``stat(*, follow_symlinks=True)``: like ``os.stat()``, but the return value is cached on the ``DirEntry`` object; does not require a system call on Windows (except for symlinks); don't follow symbolic links (like ``os.lstat()``) if ``follow_symlinks`` is False * ``inode()``: return the inode number of the entry; the return value is cached on the ``DirEntry`` object Here's a very simple example of ``scandir()`` showing use of the ``DirEntry.name`` attribute and the ``DirEntry.is_dir()`` method: .. code-block:: python def subdirs(path): """Yield directory names not starting with '.' under given path.""" for entry in os.scandir(path): if not entry.name.startswith('.') and entry.is_dir(): yield entry.name This ``subdirs()`` function will be significantly faster with scandir than ``os.listdir()`` and ``os.path.isdir()`` on both Windows and POSIX systems, especially on medium-sized or large directories. Further reading --------------- * `The Python docs for scandir `_ * `PEP 471 `_, the (now-accepted) Python Enhancement Proposal that proposed adding ``scandir`` to the standard library -- a lot of details here, including rejected ideas and previous discussion Flames, comments, bug reports ----------------------------- Please send flames, comments, and questions about scandir to Ben Hoyt: http://benhoyt.com/ File bug reports for the version in the Python 3.5 standard library `here `_, or file bug reports or feature requests for this module at the GitHub project page: https://github.com/benhoyt/scandir