summaryrefslogtreecommitdiffstats
path: root/third_party/python/scandir/README.rst
blob: a5537517dda2590252257a74b664cc772c31e2f9 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
scandir, a better directory iterator and faster os.walk()
=========================================================

.. image:: https://img.shields.io/pypi/v/scandir.svg
   :target: https://pypi.python.org/pypi/scandir
   :alt: scandir on PyPI (Python Package Index)

.. image:: https://travis-ci.org/benhoyt/scandir.svg?branch=master
   :target: https://travis-ci.org/benhoyt/scandir
   :alt: Travis CI tests (Linux)

.. image:: https://ci.appveyor.com/api/projects/status/github/benhoyt/scandir?branch=master&svg=true
   :target: https://ci.appveyor.com/project/benhoyt/scandir
   :alt: Appveyor tests (Windows)


``scandir()`` is a directory iteration function like ``os.listdir()``,
except that instead of returning a list of bare filenames, it yields
``DirEntry`` objects that include file type and stat information along
with the name. Using ``scandir()`` increases the speed of ``os.walk()``
by 2-20 times (depending on the platform and file system) by avoiding
unnecessary calls to ``os.stat()`` in most cases.


Now included in a Python near you!
----------------------------------

``scandir`` has been included in the Python 3.5 standard library as
``os.scandir()``, and the related performance improvements to
``os.walk()`` have also been included. So if you're lucky enough to be
using Python 3.5 (release date September 13, 2015) you get the benefit
immediately, otherwise just
`download this module from PyPI <https://pypi.python.org/pypi/scandir>`_,
install it with ``pip install scandir``, and then do something like
this in your code:

.. code-block:: python

    # Use the built-in version of scandir/walk if possible, otherwise
    # use the scandir module version
    try:
        from os import scandir, walk
    except ImportError:
        from scandir import scandir, walk

`PEP 471 <https://www.python.org/dev/peps/pep-0471/>`_, which is the
PEP that proposes including ``scandir`` in the Python standard library,
was `accepted <https://mail.python.org/pipermail/python-dev/2014-July/135561.html>`_
in July 2014 by Victor Stinner, the BDFL-delegate for the PEP.

This ``scandir`` module is intended to work on Python 2.6+ and Python
3.2+ (and it has been tested on those versions).


Background
----------

Python's built-in ``os.walk()`` is significantly slower than it needs to be,
because -- in addition to calling ``listdir()`` on each directory -- it calls
``stat()`` on each file to determine whether the filename is a directory or not.
But both ``FindFirstFile`` / ``FindNextFile`` on Windows and ``readdir`` on Linux/OS
X already tell you whether the files returned are directories or not, so
no further ``stat`` system calls are needed. In short, you can reduce the number
of system calls from about 2N to N, where N is the total number of files and
directories in the tree.

In practice, removing all those extra system calls makes ``os.walk()`` about
**7-50 times as fast on Windows, and about 3-10 times as fast on Linux and Mac OS
X.** So we're not talking about micro-optimizations. See more benchmarks
in the "Benchmarks" section below.

Somewhat relatedly, many people have also asked for a version of
``os.listdir()`` that yields filenames as it iterates instead of returning them
as one big list. This improves memory efficiency for iterating very large
directories.

So as well as a faster ``walk()``, scandir adds a new ``scandir()`` function.
They're pretty easy to use, but see "The API" below for the full docs.


Benchmarks
----------

Below are results showing how many times as fast ``scandir.walk()`` is than
``os.walk()`` on various systems, found by running ``benchmark.py`` with no
arguments:

====================   ==============   =============
System version         Python version   Times as fast
====================   ==============   =============
Windows 7 64-bit       2.7.7 64-bit     10.4
Windows 7 64-bit SSD   2.7.7 64-bit     10.3
Windows 7 64-bit NFS   2.7.6 64-bit     36.8
Windows 7 64-bit SSD   3.4.1 64-bit     9.9
Windows 7 64-bit SSD   3.5.0 64-bit     9.5
CentOS 6.2 64-bit      2.6.6 64-bit     3.9
Ubuntu 14.04 64-bit    2.7.6 64-bit     5.8
Mac OS X 10.9.3        2.7.5 64-bit     3.8
====================   ==============   =============

All of the above tests were done using the fast C version of scandir
(source code in ``_scandir.c``).

Note that the gains are less than the above on smaller directories and greater
on larger directories. This is why ``benchmark.py`` creates a test directory
tree with a standardized size.


The API
-------

walk()
~~~~~~

The API for ``scandir.walk()`` is exactly the same as ``os.walk()``, so just
`read the Python docs <https://docs.python.org/3.5/library/os.html#os.walk>`_.

scandir()
~~~~~~~~~

The full docs for ``scandir()`` and the ``DirEntry`` objects it yields are
available in the `Python documentation here <https://docs.python.org/3.5/library/os.html#os.scandir>`_. 
But below is a brief summary as well.

    scandir(path='.') -> iterator of DirEntry objects for given path

Like ``listdir``, ``scandir`` calls the operating system's directory
iteration system calls to get the names of the files in the given
``path``, but it's different from ``listdir`` in two ways:

* Instead of returning bare filename strings, it returns lightweight
  ``DirEntry`` objects that hold the filename string and provide
  simple methods that allow access to the additional data the
  operating system may have returned.

* It returns a generator instead of a list, so that ``scandir`` acts
  as a true iterator instead of returning the full list immediately.

``scandir()`` yields a ``DirEntry`` object for each file and
sub-directory in ``path``. Just like ``listdir``, the ``'.'``
and ``'..'`` pseudo-directories are skipped, and the entries are
yielded in system-dependent order. Each ``DirEntry`` object has the
following attributes and methods:

* ``name``: the entry's filename, relative to the scandir ``path``
  argument (corresponds to the return values of ``os.listdir``)

* ``path``: the entry's full path name (not necessarily an absolute
  path) -- the equivalent of ``os.path.join(scandir_path, entry.name)``

* ``is_dir(*, follow_symlinks=True)``: similar to
  ``pathlib.Path.is_dir()``, but the return value is cached on the
  ``DirEntry`` object; doesn't require a system call in most cases;
  don't follow symbolic links if ``follow_symlinks`` is False

* ``is_file(*, follow_symlinks=True)``: similar to
  ``pathlib.Path.is_file()``, but the return value is cached on the
  ``DirEntry`` object; doesn't require a system call in most cases; 
  don't follow symbolic links if ``follow_symlinks`` is False

* ``is_symlink()``: similar to ``pathlib.Path.is_symlink()``, but the
  return value is cached on the ``DirEntry`` object; doesn't require a
  system call in most cases

* ``stat(*, follow_symlinks=True)``: like ``os.stat()``, but the
  return value is cached on the ``DirEntry`` object; does not require a
  system call on Windows (except for symlinks); don't follow symbolic links
  (like ``os.lstat()``) if ``follow_symlinks`` is False

* ``inode()``: return the inode number of the entry; the return value
  is cached on the ``DirEntry`` object

Here's a very simple example of ``scandir()`` showing use of the
``DirEntry.name`` attribute and the ``DirEntry.is_dir()`` method:

.. code-block:: python

    def subdirs(path):
        """Yield directory names not starting with '.' under given path."""
        for entry in os.scandir(path):
            if not entry.name.startswith('.') and entry.is_dir():
                yield entry.name

This ``subdirs()`` function will be significantly faster with scandir
than ``os.listdir()`` and ``os.path.isdir()`` on both Windows and POSIX
systems, especially on medium-sized or large directories.


Further reading
---------------

* `The Python docs for scandir <https://docs.python.org/3.5/library/os.html#os.scandir>`_
* `PEP 471 <https://www.python.org/dev/peps/pep-0471/>`_, the
  (now-accepted) Python Enhancement Proposal that proposed adding
  ``scandir`` to the standard library -- a lot of details here,
  including rejected ideas and previous discussion


Flames, comments, bug reports
-----------------------------

Please send flames, comments, and questions about scandir to Ben Hoyt:

http://benhoyt.com/

File bug reports for the version in the Python 3.5 standard library
`here <https://docs.python.org/3.5/bugs.html>`_, or file bug reports
or feature requests for this module at the GitHub project page:

https://github.com/benhoyt/scandir