Skip to content

Branching design for Tier 2 (uops) interpreter #106529

@gvanrossum

Description

@gvanrossum
Member

This issue is part of the larger epic of gh-104584. In PR gh-106393 I tried to implement branching, but it was premature. Here's a better design, following @markshannon's guidance.

We have the following jump instructions (not counting the instrumented versions):

Unconditional jumps:

  • JUMP_BACKWARD
    JUMP_BACKWARD_NO_INTERRUPT
    JUMP_FORWARD

Branches, a.k.a. conditional jumps:

  • POP_JUMP_IF_FALSE, POP_JUMP_IF_TRUE, POP_JUMP_IF_NONE, POP_JUMP_IF_NOT_NONE

  • FOR_ITER's specializations:

    • FOR_ITER_LIST
    • FOR_ITER_TUPLE
    • FOR_ITER_RANGE
  • FOR_ITER_GEN

    SEND

    Add counters to to POP_JUMP_IF_{FALSE,TRUE} to determine likeliness

The translation strategies could be as follows:

Unconditional jumps

JUMP_BACKWARD

  • If this jumps to exactly the top of the current trace, emit a Tier 2 JUMP_TO_TOP uop, and stop projecting (i.e., exit the trace generation loop). The JUMP_TO_TOP uop implementation should include a CHECK_EVAL_BREAKER call.
  • If this jumps anywhere else, emit a SAVE_IP uop with the destination of the jump, followed by an EXIT_TRACE uop, and stop projecting.

JUMP_BACKWARD_NO_INTERRUPT

  • Since this is typically only used in special circumstances, just emit a SAVE_IP instruction with the destination and an EXIT_TRACE uop, and stop projecting.
  • Alternatively, we could make CHECK_EVAL_BREAKER a separate UOP that is inserted for JUMP_BACKWARD but not for JUMP_BACKWARD_NO_INTERRUPT, and otherwise treat the two backward jumps the same.

JUMP_FORWARD

  • Emit a SAVE_IP uop with the destination of the jump, and continue projecting from there (i.e. set instr to the destination of the jump).

Conditional jumps (branches)

POP_JUMP_IF_FALSE and friends

Consider the following Python code:

if cond:
    block
rest

This translates roughly to the following Tier 1 bytecode (using B1, B2, ... to label Tier 1 instructions, and <cond>, <block> etc. to represent code blocks that evaluate or execute the corresponding Python fragments):

B1: <cond>
B2: POP_JUMP_IF_FALSE B4
B3: <block>
B4: <rest>
B5:

I propose the following translation into Tier 2 uops, assuming the branch is "unlikely":

    SAVE_IP B1
    <cond>
    SAVE_IP B2
    JUMP_IF_FALSE stub
    POP_TOP
    SAVE_IP B3
    <block>
    SAVE_IP B4
    EXIT_TRACE

stub:
    POP_TOP
    SAVE_IP B4
    EXIT_TRACE

Where JUMP_IF_FALSE inspects the top of stack but doesn't pop it, and has an argument that executes a jump in the Tier 2 uop instruction sequence.

If the branch is "likely", we do this instead:

    SAVE_IP B1
    <cond>
    SAVE_IP B2
    JUMP_IF_TRUE stub
    POP_TOP
    SAVE_IP B4
    <rest>
    SAVE_IP B5
    EXIT_TRACE

stub:
    POP_TOP
    SAVE_IP B3
    EXIT_TRACE

Note how in this case, <rest> is projected as part of the trace, while <block> is not, since the likely case is that we jump over <block> to <rest>.

For the other simple conditional jumps (POP_JUMP_IF_TRUE, POP_JUMP_IF_NONE, POP_JUMP_IF_NOT_NONE) we do the same: if the jump is unlikely, emit a JUMP_IF_XXX uop and a stub; if the jump is likely, emit the inverse JUMP_IF_NOT_XXX uop and a different stub, and continue projecting at the destination of the original jump bytecode.

I propose to have hand-written cases both in the superblock generator and in the Tier 2 interpreter for these, since the translations are too irregular to fit easily in the macro expansion data structure. The stub generation will require additional logic and data structures in translate_bytecode_to_trace() to keep track of the stubs required so far, the available space for those, and the back-patching required to set the operands for the JUMP_IF_XXX uops.

FOR_ITER and (especially) its specializations

The common case for these is not to jump. The bytecode definitions are too complex to duplicate in hand-written Tier 2 uops. My proposal is to change these in bytecodes.c so that, instead of using the JUMPBY(n) macro, they use JUMPBY_POP_DISPATCH(n), which in Tier 1 translates into just JUMPBY(n), but in Tier 2 translates into roughly

frame->prev_instr += (x);
PY_DECREF(stack_pointer[-1]);
stack_pointer -= 1;
goto exit;

thereby exiting the trace when the corresponding for-loop terminates.

I am assuming here that most loops have several iterations. I don't think it's worth special-casing the occasional for-loop that almost always immediately terminates.

SEND

Possibly we could treat this the same as FOR_ITER. But right now I propose to just punt here, and when we encounter it, stop projecting, as we do with any other unsupported bytecode instruction.

Linked PRs

Activity

gvanrossum

gvanrossum commented on Jul 7, 2023

@gvanrossum
MemberAuthor

Separate from all this is the design for the counters to be used to determine whether a particular POP_JUMP_IF_XXX is more likely to jump or not. @markshannon's suggestion is to add a 16 bit cache entry to those bytecode instructions, initialized with a pattern of alternating ones and zeros. Each time we execute the instruction we shift the value left by 1, shifting in a 1 for a jump taken or a 0 for a jump not taken. When the question is asked, "is this jump likely", we simply count the bits, and if it's mostly ones we assume it is, if it's mostly zeros we assume it isn't. If it's half-half, well, we have a tough choice.

UPDATE: See gh-109039.

gvanrossum

gvanrossum commented on Jul 9, 2023

@gvanrossum
MemberAuthor

Q. In Tier 2, should we have POP_JUMP_IF_XXX, which (like its Tier 1 counterpart) combines a pop and a branch, or should we emit separate POP_TOP instructions?

Emitting separate POP_TOP uops seems perhaps simpler, but it's less performant to interpret, since the explicit POP_TOP will call Py_DECREF() on the stack top, whereas POP_JUMP_IF_TRUE/FALSE won't need to do that at all (True an False are immortal), an POP_JUMP_IF_[NOT_]NONE can skip that in the None case.

I don't have much intuition yet about which will be easier for the optimizer to handle. It's easy to change.

This also reminds me of a question for @brandtbucher: assuming that at some point the copy-and-patch machinery will use a Tier 2 uop sequence as input, how would you want to handle hand-written uops like JUMP_IF_FALSE or SAVE_IP? (If the answer is long or complicated, we should take this to the faster-cpython/ideas tracker.)

gvanrossum

gvanrossum commented on Jul 9, 2023

@gvanrossum
MemberAuthor

Similarly, in his original guidance @markshannon suggests that POP_JUMP_IF_NONE be translated into LOAD_CONST None; IS_OP; POP_JUMP_IF_TRUE (and similar for POP_JUMP_IF_NOT_NONE). I think that if we decide to do that, it should be done in compile.c when generating Tier 1 bytecode, since in the optimizer it would be hard to ensure there is a co_consts entry for None. In the short term it might be simpler to just add aditional uops JUMP_IF_[NOT_]NONE with hand-written implementations and translations. If we're doing it in compile.c I'd like to enroll @iritkatriel's help. But it definitely feels like a pessimization to replace one bytecode instruction with three (and the same for uops in the Tier 2 interpreter).


UPDATE: Never mind, Mark did this in GH-106599 using a macro in bytecodes.c.

added
performancePerformance or resource usage
interpreter-core(Objects, Python, Grammar, and Parser dirs)
and removed
type-bugAn unexpected behavior, bug, or error
on Jul 10, 2023
markshannon

markshannon commented on Jul 10, 2023

@markshannon
Member

POP_JUMP_IF_[TRUE|FALSE] seems not only faster than JUMP_IF_[TRUE|FALSE], but more consistent with other operations. The VM is a stack machine; it is normal for an operation to consume its argument(s).

markshannon suggests that POP_JUMP_IF_NONE be translated into LOAD_CONST None; IS_OP; POP_JUMP_IF_TRUE

The point of my original suggestions was that using additional micro-ops can help us to reduce the number of branch micro-ops. As you point out None might not be present in co_consts, so we would want to translate POP_JUMP_IF_NONE as IS_NONE; POP_JUMP_IF_TRUE.
We can add as many micro-ops as we want. We just want to minimize the number of micro-ops that need special treatment, like branches, jumps and calls.

how would you want to handle hand-written uops like... SAVE_IP...

All micro-ops are hand written, so SAVE_IP would be handled like any other simple micro-op.

mdboom

mdboom commented on Jul 10, 2023

@mdboom
Contributor

Separate from all this is the design for the counters to be used to determine whether a particular POP_JUMP_IF_XXX is more likely to jump or not. @markshannon's suggestion is to add a 16 bit cache entry to those bytecode instructions, initialized with a pattern of alternating ones and zeros. Each time we execute the instruction we shift the value left by 1, shifting in a 1 for a jump taken or a 0 for a jump not taken. When the question is asked, "is this jump likely", we simply count the bits, and if it's mostly ones we assume it is, if it's mostly zeros we assume it isn't. If it's half-half, well, we have a tough choice.

How is this different from a counter, where you add for jump taken and subtract for jump not taken? I understand it records some recency information, but if that isn't used, isn't it better to get a more accurate number (with a memory larger than 16 entries?)

markshannon

markshannon commented on Jul 10, 2023

@markshannon
Member

A counter doesn't tell you anything about the recent history.
A branch that goes one way a 50 times then the other way a 50 times is different to one that alternates.
I suspect it won't matter much in practice, though, as long we have some hint as to the bias of the branch.

An alternative is to count the length and direction of the last two runs, but that's more complex and slower.

gvanrossum

gvanrossum commented on Jul 10, 2023

@gvanrossum
MemberAuthor

All micro-ops are hand written, so SAVE_IP would be handled like any other simple micro-op.

There seems to be some terminological confusion here. I'v been thinking of uops as anything that the Tier 2 interpreter can interpret. Most of those (over a 100 of them) are generated by the generator. There's only a handful hand-written ones (i.e., explicitly written down in the switch in _PyUopExecute) -- the OGs (EXIT_TRACE and SAVE_IP) and now the POP_JUMP_IF_XXX crowd.

But it doesn't matter, we seem to be in agreement.

gvanrossum

gvanrossum commented on Jul 10, 2023

@gvanrossum
MemberAuthor

To clarify, my concern about the few hand-written uops is that IIUC @brandbucher's copy-and-patch tooling uses as its input either bytecodes.c or one of its .c.h outputs in order to generate its templates. For the hand-written uops it will need a way to use the hand-written code as template source. Maybe we need to add them to bytecodes.c after all, marks as uops.

26 remaining items

added a commit that references this issue on Jul 15, 2023
added a commit that references this issue on Jul 17, 2023
gvanrossum

gvanrossum commented on Nov 15, 2023

@gvanrossum
MemberAuthor

The design for POP_JUMP_IF_FALSE etc. has changed again since gh-112045 -- there are now no longer any "branching" ops in Tier 2, everything is done through deoptimization.

However, we have another case: unspecialized FOR_ITER. According to the stats this is now by far the most frequent untranslated opcode, so we would like to translate it to Tier 2. FOR_ITER is already macro(FOR_ITER) = _SPECIALIZE_FOR_ITER + _FOR_ITER.

Alas, _FOR_ITER (the non-specializing part of the macro) is not straightforward to translate -- it does a bunch of work (calling tp_iternext) and if this returns NULL without setting an exception it jumps over over the corresponding END_FOR instruction. The best I can think of for Tier 2 is to make it deopt in that case to the END_FOR opcode (which will handle the stack pops). But this feels like a bit of a hack. @markshannon?

gvanrossum

gvanrossum commented on Nov 16, 2023

@gvanrossum
MemberAuthor

The best I can think of for Tier 2 is to make it deopt in that case to the END_FOR opcode (which will handle the stack pops).

That seems to work, except the deopt code must be duplicated (since we can't use target as the deopt destination) and we must deopt to the instruction past the END_FOR, since that instruction pops two stack items, but we only have one (the iterator). See gh-112134.

added 2 commits that reference this issue on Nov 17, 2023
added 2 commits that reference this issue on Feb 11, 2024
added 2 commits that reference this issue on Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    interpreter-core(Objects, Python, Grammar, and Parser dirs)performancePerformance or resource usage

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @mdboom@gvanrossum@markshannon

        Issue actions

          Branching design for Tier 2 (uops) interpreter · Issue #106529 · python/cpython