Implementing Pattern Matching in Python

Introduction and Motivation

If you want to analyse, or even optimise Python code, you probably want to do that on the basis of the Abstract Syntax Tree (AST). The AST is, as the name suggests, an abstract representation of the code in the form of a tree structure (see also my article on Implementing Code Transformations in Python).

Python provides nice tools for analysing, and manipulating the AST of any Python code. There is just one problem: if you have some specific structure in mind, which you want to optimise, say, your code can quickly get rather large, and ugly. Take a very simple example like --x, where we want to cancel out the two minus signs. Here is the code that achieves this:

import ast

class SignCancel(ast.NodeTransformer):

    def visit_UnaryOp(self, node: ast.UnaryOp):
        if isinstance(node.op, ast.USub) and \
            isinstance(node.operand, ast.UnaryOp) and \
            isinstance(node.operand.op, ast.USub):
            return node.operand.operand
        else:
            return self.generic_visit(node)

tree = ast.parse('--x')
print(ast.dump(tree))
sc = SignCancel()
tree = sc.visit(tree)
print(ast.dump(tree))

Even for this really simple example, we end up with an if statement that is hardly legible, anymore. If, however, we had some sort of pattern matching in Python, as we do in languages like Scala, we could write the method like this, say:

from ast import UnaryOp, USub

class SignCancel(ast.NodeTransformer):

    def visit_UnaryOp(self, node: UnaryOp):
        match node:
            case UnaryOp(USub(), UnaryOp(USub(), x)):
                return x
            case _:
                return self.generic_visit(node)

Is this not much clearer, and simpler? The syntax of the expression behind the case is exactly how you would construct --x, had you given x. But today’s article is not about the syntax of the case statement, but rather about an aspect of its implementation.

If you are familiar with C’s switch statement, then you might think of a case statement as something akin to an if statement. Indeed, the case statement in the code above does perform a test. But there is more to it: the case statement here also performs an assignment. If node is found to correspond to the specified structure, the inner operand is extracted, and assigned to the local variable x. This is a crucial part of pattern matching: without understanding the double nature of test and assignment, nothing wonderful can come of the story I am about to relay to you.

By the way: what I describe in this article is actually implemented in pyPMatch.

Implementing the Syntax

For the time being, Python does not support any syntax for pattern matching. If we want to have something as in the code above, then we need to do some preprocessing. Our preprocessor needs to scan the Python file for occurrences of case, and then replace it with something appropriate (yes, match statements need to be replaced, too, but let us for the sake of simplicity concentrate on case here).

To find case statements, I simply use the tokenize module of Python. If source is a variable containing the original source code, here is how this might look like:

case_stmts = []
colons = []
for token in tokenize.tokenize(io.BytesIO(source.encode('utf-8')).readline):
    if token.string == 'case':
        case_stmts.append(token.start)
    if token.string == ':' and len(case_stmts) > len(colons):
        colons.append(token.start)
for (a, b) in reversed(list(zip(case_stmts, colons))):
    source = source[:a] + compile_case(source[a:b]) + source[b:]

For each case statement, we need the position of both the case “keyword”, as well as the location of the following colon :. Of course, the real implementation needs to be more careful to find the actual corresponding colon, but, again, let us keep it simple here.

In order to not mess up the locations of the case we have discovered, we have to do replacements in reverse order. The really important part, however, is what happens inside the compile_case function. The juicy stuff of translating the UnaryOp(USub(), ...) to a series of isinstance() tests I leave for another time, though. Today, I want to elaborate on something else:

Objective: Replace the case X: statement with a valid Python equivalent, so that we do not have to change a single character outside the statement itself. To perform this surgery on our Python code, I want to do the absolute minimum required to achieve our goal. In particular, I do not want to change indentation anywhere, and I certainly don’t want to add, or remove any line.

One of the reasons for wanting to minimise the incision are error messages. In case of an error, the reported line, and position within the source code should still be valid. But, apart from this, it is a good idea not to mess up code, unless we really have to, anyway.

The Example

For the following discussion, we will use the following simple example of a case statement. We want to test if the given argument arg has the structure of --x, and extract the x if it does.

def foo(arg):
    case arg as ast.UnaryOp(ast.USub(), ast.UnaryOp(ast.USub(), x)):
        return x

Using FOR-Loops

A very reasonable, and simple solution is to use for-loops. They come with the benefit of assigning values to variables — if the respective values exist in the first place. Otherwise, we iterate over an “empty list”, which means that the loop’s body is just skipped. Thanks to Python’s generators, and the yield statement, the necessary code is rather short and neat:

def case1(value):
    if (isinstance(value, ast.UnaryOp) and isinstance(value.op, ast.USub) and
        isinstance(value.operand, ast.UnaryOp) and isinstance(value.operand.op, ast.USub)):
        yield value.operand.operand

def foo(arg):
    for x in case1(arg):
        return x

The yield in line 4 makes case1 a generator. If the passed value passes all the tests about its structure, the generator returns exactly one value, which is the x we want to extract. Otherwise, the generator is “empty”. Hence, the for-loop iterates either once, or is skipped altogether.

The objective clearly said that we do not want to introduce additional lines of code. The definition of case1 in this example qualifies as additional lines of code. However, later on, we can put this definition into a separate module, and dynamically import it into the code. The important thing is that the code inside foo remains untouched, except for the case statement.

It seems like the for-loop is indeed an ideal candidate as a replacement for case statements. There is just one detail I do not like about this approach. Using a for-loop means that any break, or continue inside the body of a case statement refer to the case statement itself. Because the loop is repeated exactly once (or never), break and continue both have the exact same effect.

Consider the following example (which is again far too trivial to be of any effective value in the real world). On first sight, this code looks perfectly fine. But if we replace the case statements by for-loops, we do not get the desired effect, but end up with an infinite loop instead. This “gotcha” is something we have to avoid at any cost: the semantics of a programming language must never surprise you.

count = 0
while True:
    match random.randint(0, 9):
        case 2 | 3 | 5 | 7:
            break
        case _:
            count += 1
print(f"There were {count} failed attempts before drawing a prime number")

Sure, C has a break statement inside cases itself. But, on the one hand, my goal is to implement pattern matching as found in Scala, and not the switch statement as found in C. On the other hand, the break in C’s switches has been repeatedly found to be a source of bugs, and confusions, and other languages removed it (or replaced it by an explicit “fall through” statement). Finally, even if we wanted to draw the parallels to C, the break in our case does not jump out of the match block, but only out of the case. Its semantics therefore differs from C, which means introducing another source of subtle bugs.

Bottomline is that using for-loops to emulate case statements leads to some unwanted side effects, such as changing the effects of break.

Using WITH-Statements

An alternative to using for-loops are with-statements. They also provide the means to assign values to (local) variables, and then have a block of code being executed. Since we do have to write a class, the additional code gets slightly larger:

class case1:

    def __init__(self, value):
        self.value = value

    def __enter__(self):
        value = self.value
        if (isinstance(value, ast.UnaryOp) and isinstance(value.op, ast.USub) and
            isinstance(value.operand, ast.UnaryOp) and isinstance(value.operand.op, ast.USub)):
            return value.operand.operand
        else:
            return None

    def __exit__(self, exc_type, exc_value, traceback):
        pass

def foo(arg):
    with case1(arg) as x:
        return x

But here is the problem: if the value does not match the pattern, then the block of the with-statement is still being executed. We even have to return a value like None to be assigned to the local variable x, even though there is nothing to be effectively assigned to x. While with-statements fit the bill in that they do not have any effect on break, say, they are obviously missing on all the important parts of the case statement. — Or do they?

What with-statements offer is the possibility to do some simplified error handling. If an error occurs inside the block of the with-statement, the __exit__ function is still called. Moreover, the __exit__ function can even consume, and suppress the error. In other words: we have a possibility to bail out of a with block. Unfortunately, if an error occurs inside the __enter__ function, there is no special treatment, and it just breaks.

If we have a look at Python’s language reference, we find the following remark about with-statements: if an error occurs during the assignment to the target list, it will be treated the same as an error occurring within the suite would be.

Let us pause here for a second, and think about what this means. If we could somehow control the assignment inside a with-statement to a degree where we could raise an error if we wanted to, then we could make Python skip the entire subsequent block of the with-statement. All we have to do is causing the assignment to fail, it seems.

How could an assignment actually fail? There are several possibilities we can take advantage of. First, we could try, and assign None to more than one variable. Second, we could try, and assign a value to a list index that does not exist. But the problem with either variant is the following: if we cause such an error, and then catch it in the __exit__ function later on to suppress it, any such error inside the with block will equally be silenced. This, again, changes the semantics, and is therefore not what we want (there is a good reason why these errors exist in the first place).

There is one more alternative for us to explore. The setter for a object’s property is basically just a regular method. It can execute arbitrary code, even check a flag, and raise an error if it needs to. And that’s how we can equip our with-statement with a conditional.

The following code introduces a class with a special field guard. If you assign something to this field, it checks if the new value evaluates to True. If not, an exception is raised.

class SkipMatchException(Exception): pass

class MatchGuard:

    def __setattr__(self, key, value):
        if key == 'guard':
            if not value:
                raise SkipMatchException()
        else:
            super().__setattr__(key, value)

matchGuard = MatchGuard()
matchGuard.guard = False  # raises an exception

In short: we have found the assignment we have been looking for: as assignment that fails if we want it to, and then raises an exception.

The code for our case statement then looks as follows:

class case1:

    def __init__(self, value):
        self.value = value

    def __enter__(self):
        value = self.value
        if (isinstance(value, ast.UnaryOp) and isinstance(value.op, ast.USub) and
            isinstance(value.operand, ast.UnaryOp) and isinstance(value.operand.op, ast.USub)):
            return True, value.operand.operand
        else:
            return False, None

    def __exit__(self, exc_type, exc_value, traceback):
        return exc_type is SkipMatchException or exc_type is None


def foo(arg):
    with case1(arg) as (mathGuard.guard, x):
        return x

In my pyPMatch library, this is how case statements are translated to regular Python code.

Using IF-Statements

As long as we are only checking if a value matches a specific pattern, an if-statement is probably the best option, and the way to go. Yet, the assignment aspect of pattern matching is a crucial part we do not want to miss. Python 3.8 is going to introduce assignment expressions, which allow the problem to be solved, and have an assignment inside the if-statement’s condition. However, this is still somewhere in the future, and it will take some time until Python 3.8 is universally adopted.

An alternative might be to use wildly dynamic features, and access the locals dictionary through frames (warning: not for the faint of heart):

import ctypes, inspect

def case1(value):
    if (isinstance(value, ast.UnaryOp) and isinstance(value.op, ast.USub) and
        isinstance(value.operand, ast.UnaryOp) and isinstance(value.operand.op, ast.USub)):
        frame = inspect.currentframe().f_back
        frame.f_locals['x'] = value.operand.operand
        ctypes.pythonapi.PyFrame_LocalsToFast(ctypes.py_object(frame), ctypes.c_int(0))
        return True
    else:
        return False

def foo(arg):
    if case1(arg):
        return x

This, however, does not work unless x is already a local variable in foo. In fact, Python assumes that the x to be returned in foo is a global variable, since there is no definition of x in foo itself. Even if we add x to the local variables later on, Python will still look for a global variable x, and not a local one (it uses the opcode LOAD_GLOBAL). Sure, you could go on, and try to modify the bytecode of foo, but are you sure this is going to pay off…?

Hence, even with advanced Python magic, coercing an if-statement to assign something to a local variable is practically impossible (up to Python 3.7, at least). Using if-statements for our pattern matching is therefore not really an option.

Concluding Remarks

My approach is most certainly not the first attempt to introduce pattern matching to Python. The discussion of this topic dates back years. However, while the discussion usually revolves around how the syntax should look like, or if having pattern matching in Python is a good idea in the first place, I am more concerned with the implementation. So, the intent of this article is neither to propose the best possible syntax, nor to say that Python really needs pattern matching (even though it would be awesome, of course), but to show that is feasible, and how it could be done.

Here are some articles on the matter, some of which contain various further links, and references. – Python-ideas: Pattern Matching Syntax – A more generalized switch statement for Python? – A pattern-matching case statement for Python – PyPatt: Python Pattern Matching