Skip to content

Commit cd28db3

Browse files
committed
bpo-31650: PEP 552 (Deterministic pycs) implementation
Python now supports checking bytecode cache up-to-dateness with a hash of the source contents rather than volatile source metadata. See the PEP for details. While a fairly straightforward idea, quite a lot of code had to be modified due to the pervasiveness of pyc implementation details in the codebase. Changes in this commit include: - The core changes to importlib to understand how to read, validate, and regenerate hash-based pycs. - Support for generating hash-based pycs in py_compile and compileall. - Modifications to our siphash implementation to support passing a custom key. We then expose it to importlib through _imp. - Updates to all places in the interpreter, standard library, and tests that manually generate or parse pyc files to grok the new format. - Support in the interpreter command line code for long options like --check-hash-based-pycs. - Tests and documentation for all of the above.
1 parent c172fc5 commit cd28db3

33 files changed

+3260
-2510
lines changed

Doc/glossary.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -458,6 +458,12 @@ Glossary
458458
is believed that overcoming this performance issue would make the
459459
implementation much more complicated and therefore costlier to maintain.
460460

461+
462+
hash-based pyc
463+
A bytecode cache file that uses the the hash rather than the last-modified
464+
time of the corresponding source file to determine its validity. See
465+
:ref:`pyc-invalidation`.
466+
461467
hashable
462468
An object is *hashable* if it has a hash value which never changes during
463469
its lifetime (it needs a :meth:`__hash__` method), and can be compared to

Doc/library/compileall.rst

Lines changed: 32 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,15 @@ compile Python sources.
8383
If ``0`` is used, then the result of :func:`os.cpu_count()`
8484
will be used.
8585

86+
.. cmdoption:: --invalidation-mode [timestamp|checked-hash|unchecked-hash]
87+
88+
Control how the generated pycs will be invalidated at runtime. The default
89+
setting, ``timestamp``, means that pyc files with the source timestamp and
90+
size embedded will be generated. The ``checked-hash`` and ``unchecked-hash``
91+
values cause hash-based pycs to be generated. Hash-based pycs embed a hash of
92+
the source file contents rather than a timestamp. See :ref:`pyc-invalidation`
93+
for more information on how Python validates bytecode cache files at runtime.
94+
8695
.. versionchanged:: 3.2
8796
Added the ``-i``, ``-b`` and ``-h`` options.
8897

@@ -91,6 +100,9 @@ compile Python sources.
91100
was changed to a multilevel value. ``-b`` will always produce a
92101
byte-code file ending in ``.pyc``, never ``.pyo``.
93102

103+
.. versionchanged:: 3.7
104+
Added the ``--invalidation-mode`` parameter.
105+
94106

95107
There is no command-line option to control the optimization level used by the
96108
:func:`compile` function, because the Python interpreter itself already
@@ -99,7 +111,7 @@ provides the option: :program:`python -O -m compileall`.
99111
Public functions
100112
----------------
101113

102-
.. function:: compile_dir(dir, maxlevels=10, ddir=None, force=False, rx=None, quiet=0, legacy=False, optimize=-1, workers=1)
114+
.. function:: compile_dir(dir, maxlevels=10, ddir=None, force=False, rx=None, quiet=0, legacy=False, optimize=-1, workers=1, invalidation_mode=py_compile.PycInvalidationMode.TIMESTAMP)
103115

104116
Recursively descend the directory tree named by *dir*, compiling all :file:`.py`
105117
files along the way. Return a true value if all the files compiled successfully,
@@ -140,6 +152,10 @@ Public functions
140152
then sequential compilation will be used as a fallback. If *workers* is
141153
lower than ``0``, a :exc:`ValueError` will be raised.
142154

155+
*invalidation_mode* should be a member of the
156+
:class:`py_compile.PycInvalidationMode` enum and controls how the generated
157+
pycs are invalidated at runtime.
158+
143159
.. versionchanged:: 3.2
144160
Added the *legacy* and *optimize* parameter.
145161

@@ -156,7 +172,10 @@ Public functions
156172
.. versionchanged:: 3.6
157173
Accepts a :term:`path-like object`.
158174

159-
.. function:: compile_file(fullname, ddir=None, force=False, rx=None, quiet=0, legacy=False, optimize=-1)
175+
.. versionchanged:: 3.7
176+
The *invalidation_mode* parameter was added.
177+
178+
.. function:: compile_file(fullname, ddir=None, force=False, rx=None, quiet=0, legacy=False, optimize=-1, invalidation_mode=py_compile.PycInvalidationMode.TIMESTAMP)
160179

161180
Compile the file with path *fullname*. Return a true value if the file
162181
compiled successfully, and a false value otherwise.
@@ -184,6 +203,10 @@ Public functions
184203
*optimize* specifies the optimization level for the compiler. It is passed to
185204
the built-in :func:`compile` function.
186205

206+
*invalidation_mode* should be a member of the
207+
:class:`py_compile.PycInvalidationMode` enum and controls how the generated
208+
pycs are invalidated at runtime.
209+
187210
.. versionadded:: 3.2
188211

189212
.. versionchanged:: 3.5
@@ -193,7 +216,10 @@ Public functions
193216
The *legacy* parameter only writes out ``.pyc`` files, not ``.pyo`` files
194217
no matter what the value of *optimize* is.
195218

196-
.. function:: compile_path(skip_curdir=True, maxlevels=0, force=False, quiet=0, legacy=False, optimize=-1)
219+
.. versionchanged:: 3.7
220+
The *invalidation_mode* parameter was added.
221+
222+
.. function:: compile_path(skip_curdir=True, maxlevels=0, force=False, quiet=0, legacy=False, optimize=-1, invalidation_mode=py_compile.PycInvalidationMode.TIMESTAMP)
197223

198224
Byte-compile all the :file:`.py` files found along ``sys.path``. Return a
199225
true value if all the files compiled successfully, and a false value otherwise.
@@ -213,6 +239,9 @@ Public functions
213239
The *legacy* parameter only writes out ``.pyc`` files, not ``.pyo`` files
214240
no matter what the value of *optimize* is.
215241

242+
.. versionchanged:: 3.7
243+
The *invalidation_mode* parameter was added.
244+
216245
To force a recompile of all the :file:`.py` files in the :file:`Lib/`
217246
subdirectory and all its subdirectories::
218247

Doc/library/importlib.rst

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,9 @@ generically as an :term:`importer`) to participate in the import process.
6767
:pep:`489`
6868
Multi-phase extension module initialization
6969

70+
:pep:`552`
71+
Deterministic pycs
72+
7073
:pep:`3120`
7174
Using UTF-8 as the Default Source Encoding
7275

@@ -1327,6 +1330,14 @@ an :term:`importer`.
13271330
.. versionchanged:: 3.6
13281331
Accepts a :term:`path-like object`.
13291332

1333+
.. function:: source_hash(source_bytes)
1334+
1335+
Return the hash of *source_bytes* as byte string. A hash-based pyc embeds the
1336+
:func:`source_hash` of the corresponding source file's contents in its
1337+
header.
1338+
1339+
.. versionadded:: 3.7
1340+
13301341
.. class:: LazyLoader(loader)
13311342

13321343
A class which postpones the execution of the loader of a module until the

Doc/library/py_compile.rst

Lines changed: 34 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ byte-code cache files in the directory containing the source code.
2727
Exception raised when an error occurs while attempting to compile the file.
2828

2929

30-
.. function:: compile(file, cfile=None, dfile=None, doraise=False, optimize=-1)
30+
.. function:: compile(file, cfile=None, dfile=None, doraise=False, optimize=-1, invalidation_mode=PycInvalidationMode.TIMESTAMP)
3131

3232
Compile a source file to byte-code and write out the byte-code cache file.
3333
The source code is loaded from the file named *file*. The byte-code is
@@ -53,6 +53,9 @@ byte-code cache files in the directory containing the source code.
5353
:func:`compile` function. The default of ``-1`` selects the optimization
5454
level of the current interpreter.
5555

56+
*invalidation_mode* should be a member of the :class:`PycInvalidationMode`
57+
enum and controls how the generated pycs are invalidated at runtime.
58+
5659
.. versionchanged:: 3.2
5760
Changed default value of *cfile* to be :PEP:`3147`-compliant. Previous
5861
default was *file* + ``'c'`` (``'o'`` if optimization was enabled).
@@ -65,6 +68,36 @@ byte-code cache files in the directory containing the source code.
6568
caveat that :exc:`FileExistsError` is raised if *cfile* is a symlink or
6669
non-regular file.
6770

71+
.. versionchanged:: 3.7
72+
The *invalidation_mode* parameter was added as specified in :pep:`552`.
73+
74+
75+
.. class:: PycInvalidationMode
76+
77+
A enumeration of possible methods the interpreter can use to determine
78+
whether a bytecode file is up to date with a source file. The pyc indicates
79+
the desired invalidation mode in its header. See :ref:`pyc-invalidation` for
80+
more information on how Python invalidates pycs at runtime.
81+
82+
.. versionadded:: 3.7
83+
84+
.. attribute:: TIMESTAMP
85+
86+
The pyc should include the timestamp and size of the source file, which
87+
Python will compare against the metadata of the source file at runtime to
88+
determine if the pyc needs to be regenerated.
89+
90+
.. attribute:: CHECKED_HASH
91+
92+
The pyc should include a hash of the source file which Python will compare
93+
against the source at runtime to determine if the pyc needs to be
94+
regenerated.
95+
96+
.. attribute:: UNCHECKED_HASH
97+
98+
The pyc should include a hash of the source file but Python should not
99+
validate it against the source file at runtime.
100+
68101

69102
.. function:: main(args=None)
70103

Doc/reference/import.rst

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -675,6 +675,32 @@ Here are the exact rules used:
675675
:meth:`~importlib.abc.Loader.module_repr` method, if defined, before
676676
trying either approach described above. However, the method is deprecated.
677677

678+
.. _pyc-invalidation:
679+
680+
Cached bytecode invalidation
681+
----------------------------
682+
683+
Before Python loads cached bytecode from ``.pyc`` file, it checks whether the
684+
cache is up-to-date with the source ``.py`` file. By default, Python does this
685+
by storing the source's last-modified timestamp and size in the cache file when
686+
writing it. At runtime, the import system then validates the cache file by
687+
checking the stored metadata in the cache file against at source's
688+
metadata.
689+
690+
Python also supports "hash-based" cache files, which store a hash of a source
691+
file contents rather than its metadata. There are two variants of hash-based
692+
pycs: checked and unchecked. For checked hash-based pycs, Python validates the
693+
cache file by hashing the source file and comparing the resulting hash with the
694+
hash in the cache file. If a checked hash-based cache file is found to be
695+
invalid, Python regenerates it and writes a new checked hash-based cache
696+
file. For unchecked hash-based pycs, Python simply assumes the cache file is
697+
valid if it exists. Hash-based pyc validation behavior may be override with the
698+
:option:`--check-hash-based-pycs` flag.
699+
700+
.. versionchanged:: 3.7
701+
Added hash-based pycs. Previously, Python only supported timestamp-based pyc
702+
invalidation.
703+
678704

679705
The Path Based Finder
680706
=====================

Doc/using/cmdline.rst

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -210,6 +210,19 @@ Miscellaneous options
210210
import of source modules. See also :envvar:`PYTHONDONTWRITEBYTECODE`.
211211

212212

213+
.. cmdoption:: --check-hash-based-pycs default|always|never
214+
215+
Control the validation behavior of hash-based pycs. See
216+
:ref:`pyc-invalidation`. When set to ``default``, checked and unchecked
217+
hash-based bytecode cache files are validated according to their default
218+
semantics. When set to ``always``, all hash-based pycs, whether checked or
219+
unchecked, are validated against their corresponding source file. When set to
220+
``never``, hash-based pycs are not validated against their corresponding
221+
source files.
222+
223+
The semantics of timestamp-based pycs are unaffected by this option.
224+
225+
213226
.. cmdoption:: -d
214227

215228
Turn on parser debugging output (for expert only, depending on compilation

Doc/whatsnew/3.7.rst

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -197,6 +197,32 @@ variable is not set in practice.
197197

198198
See :option:`-X` ``dev`` for the details.
199199

200+
Hash-based pycs
201+
---------------
202+
203+
Python has traditionally checked the up-to-dateness of bytecode cache files
204+
(i.e., pycs) by comparing the source metadata (last-modified timestamp and size)
205+
with source metadata saved in the cache file header when it was generated. While
206+
effective, this invalidation method has its drawbacks. When filesystem
207+
timestamps are too coarse, Python can miss source updates, leading to user
208+
confusion. Additionally, having a timestamp in the cache file is problematic for
209+
`build reproduciblity <https://reproducible-builds.org/>`_ and content-based
210+
build systems.
211+
212+
:pep:`552` extends the pyc format to allow the hash of the source file to be
213+
used for invalidation instead of the source timestamp. Such pycs are called
214+
"hash-based". By default, Python still uses timestamp-based invalidation and
215+
does not generate hash-based pycs at runtime. Hash-based pycs may be generated
216+
with :mod:`py_compile` or :mod:`compileall`.
217+
218+
Hash-based pycs come in two variants: checked and unchecked. Python validates
219+
checked hash-based pycs against the source file at runtime but doesn't do so for
220+
unchecked hash-based pycs. Unchecked hash-based pycs are a useful performance
221+
optimization for environments where a system external to Python (e.g., the build
222+
system) is responsible for keeping pycs up-to-date.
223+
224+
See :ref:`pyc-invalidation` for more information.
225+
200226

201227
Other Language Changes
202228
======================

Include/internal/hash.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
#ifndef Py_INTERNAL_HASH_H
2+
#define Py_INTERNAL_HASH_H
3+
4+
uint64_t _Py_KeyedHash(uint64_t, const char *, Py_ssize_t);
5+
6+
#endif

Include/internal/import.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
#ifndef Py_INTERNAL_IMPORT_H
2+
#define Py_INTERNAL_IMPORT_H
3+
4+
extern const char *_Py_CheckHashBasedPycsMode;
5+
6+
#endif

Include/pygetopt.h

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,14 @@ PyAPI_DATA(wchar_t *) _PyOS_optarg;
1212

1313
PyAPI_FUNC(void) _PyOS_ResetGetOpt(void);
1414

15-
PyAPI_FUNC(int) _PyOS_GetOpt(int argc, wchar_t **argv, wchar_t *optstring);
15+
typedef struct {
16+
const wchar_t *name;
17+
int has_arg;
18+
int val;
19+
} _PyOS_LongOption;
20+
21+
PyAPI_FUNC(int) _PyOS_GetOpt(int argc, wchar_t **argv, wchar_t *optstring,
22+
const _PyOS_LongOption *longopts, int *longindex);
1623
#endif /* !Py_LIMITED_API */
1724

1825
#ifdef __cplusplus

0 commit comments

Comments
 (0)