How to Write a Compiler #5 - The Visitor Pattern

If ever there was a design pattern that's perfectly suited to a job, it's the visitor pattern for working with an abstract syntax tree. But what is the visitor pattern and why is it so well suited?

How to Write a Compiler #5 - The Visitor Pattern
💡
This post is part of the series How to Write a Compiler based on my experience building C-minor - a strongly typed, garbage collected, direct to in-memory machine-code scripting language.

If you've been doing software development for any length of time you've almost certainly heard of the visitor pattern - it's even got its own Wikipedia page.

But it's also quite possible that you've never actually used it. Rather than try to explain it in theoretical terms, let's look at it from a practical point of view.

(Normally I wouldn't devote a whole post to a design pattern, but it really is central to everything else the compiler does. It's also a rarely used pattern so I wanted to make sure we're on the same page on how it works).

Working with the AST

Now that our compiler has parsed the source code into an AST we need a way to work with the AST. We might want to go over it to check for errors, evaluate a constant expression node, or generate code. Or perhaps we just want to print it out.

Let's take that last one - "print it out" - and see how to implement it.

The Traditional OO Approach

In order to print out the AST we need to start at the top-level element and print the parts related to that element while also recursively calling its connected elements to print themselves.

The traditional OO way to do this would be to add an abstract method to the base AstElement class named Print().

class AstElement
{
    // omitted

    public abstract void Print(IndentedTextwriter target);
}

(Let's assume a helper class IndentedTextWriter that we can print to).

Next, we'd implement the Print method on every class. For example, we'd add a Print method to the if statement like this:

public override void Print(IndentedTextWriter target)
{
    target.Write("if (");
    Condition.Print(target);
    target.WriteLine(")");

    TrueBlock.Print(target);

    if (FalseBlock != null)
    {
        target.WriteLine("else");
        FalseBlock.Print(target);
    }
}

To print the entire AST all we have to do is call Print on the top level AstCompilationUnit and the entire program can be printed out. Nice!

Need to generate code? Repeat the above adding a GenerateCode() method.

Need something else? Repeat again adding another method.

Very quickly a problem emerges: the code for each of these operations is spread over many different classes and files. Also, any state or context information for the operations (such as the target writer in the above example) needs to be passed around as parameters.

Containing an Operation to One Class

What we'd like to do is contain each operation on the AST to its own class.

Continuing with our example of printing out the AST, let's make a class called AstFormatter:

class AstFormatter
{
    public AstFormatter(IndentedTextWriter target)
    {
        _target = target;
    }

    public void Print(AstElement element)
    {
        // TODO
    }
}

and let's move the print methods from all the AstElement classes into our new AstFormatter . The function to print if statements now looks like this:

void PrintIfStatemnt(AstIfStatement stmt)
{
    _target.Write("if (");
    Print(stmt.Condition);
    _target.WriteLine(")");

    Print(stmt.TrueBlock);

    if (stmt.FalseBlock != null)
    {
        _target.WriteLine("else");
        Print(FalseBlock);
    }
}

This is looking good:

  1. We've moved all our printing code to one class
  2. We don't need to pass the target text writer around anymore since we can just store it as a member variable on the AstFormatter class.

But we've also created a mess. The worst kind of mess. A slow mess.

See that TODO comment in the Print method? It needs to inspect every element's type to work out which print method to call for each element:

if (element is AstIfStatement stmt_if)
  PrintIfStatement(stmt_if);
else if (element is AstForStatement stmt_for)
  PrintForStatement(stmt_for);
else if (element is AstWhileStatement stmt_while)
  PrintWhileStatement(stmt_while);
else if 

// ... for every ... type ... of ... element. Ugh

We've solved the original problem, but:

  • it's slow - we need to test every element for every type.
  • it's messy - just look at it!
  • it's unsafe - we won't get compiler errors if we forget to implement an element (or add a new element type).
  • it's painful - we need to re-type (or cut/copy/replace) that code for every different operation.

What we need is a type-safe, fast way to dispatch those PrintXXX() methods.

Using a Dispatch Interface

Imagine we've implemented all our operations using the above approach: a separate class for each operation, with a function that dispatches to methods for each element type.

What you'll notice is that you end up with each class having a method to handleif statements, a method to handle for statements, a method to handle while statement etc...

Let's force each of those classes to implement those methods by defining an interface IStatementProcesor that has a method for each statement type.

interface IStatementProcessor
{
    void ProcessIfStatement(AstIfStatement stmt);
    void ProcessForStatement(AstForStatement stmt);
    void ProcessWhileStatement(AstForStatement stmt);
    // etc.. for all statement types
}

By implementing this interface on an operation class we can ensure that it implements the functionality for all element types:

class AstFormatter : IStatementProcessor
{
    // If we don't implement a particular element type...
    // the compiler will complain
}

That solves the "unsafe" problem with the previous approach, but we can also use it to solve the "slow" and "messy" problems by adding a Process method to the element classes:

abstract class AstElement
{
    public abstract void Process(IStatementProcessor processor)
}

class AstIfStatement : AstElement
{
    public override void Process(IStatementProcessor processor)
    {
        processor.ProcessIfStatment(this);
    }
}

In other words, if you call Process on an element, passing an instance of IStatementProcessor you'll be called back through that interface on the correct method for the element's type.

Now the big slow mess can be fixed:

// GET RID OF THIS
/*
if (element is AstIfStatement stmt_if)
  PrintIfStatement(stmt_if);
else if (element is AstForStatement stmt_for)
  PrintForStatement(stmt_for);
else if (element is AstWhileStatement stmt_while)
  PrintWhileStatement(stmt_while);
else if ...
*/

element.Process(this);

Oh, and by the way, we've also solved the "painful" problem. Right clicking on the interface in Visual Studio there's a command "Implement Interface Explicitly" and stubs for every method will be automatically generated and we can just fill in the blanks.

What I've just described is the Visitor Pattern... so let's call it that.

The Visitor Pattern

Now that we understand what the visitor pattern is, let's look at how it's actually implemented in C-minor for the AST.

As before, there are two categories of visitors - statements and expressions and since we might want to have return values when processing elements the interfaces are defined with a generic type parameter T for the return type:

public interface IAstStatementVisitor<T>
{
  T Visit(AstIfStatement stmt);
  T Visit(AstForStatement stmt);
  T Visit(AstWhileStatement stmt);
  // etc
}

public interface IAstExprNodeVisitor<T>
{
  T Visit(AstExprNodeLiteral node);
  T Visit(AstExprNodeUnaryOp node);
  T Visit(AstExprNodeBinaryOp node);
  // etc...
}

(Note: we can use method overloading to name every method Visit() instead of VisitIfStatement(), VisitForStatement() etc...)

The AstStatement and AstExprNode bases classes both get an abstract Visit method:

public class AstStatement
{
    public abstract T Visit<T>(IAstStatementVisitor<T> visitor);
}

public class AstExprNode
{
    public abstract T Visit<T>(IAstExprNodeVisitor<T> visitor);
}

and every statement and expression node implements the Visit method with a simple one liner:

public class AstIfStatement
{
    public override T Visit<T>(IAstStatementVisitor<T> visitor) => visitor.Visit(this);
}

Revisiting (ha-ha) the AstFormatter

Now that we've formalized our visitor pattern, let's take a look at how our original AstFormatter class now looks:

class AstFormatter : 
    IAstStatementVisitor<bool>,
    IAstExprNodeVisitor<bool>
{
    public AstFormatter(IndentedTextWriter target)
    {
        _target = target;
    }

    public void Print(AstStatment stmt)
    {
        stmt.Visit(this);
    }

    bool IAstStatementVisitor<bool>.Visit(AstIfStatement stmt)
    {
        _target.Write("if (");
        stmt.Condition.Visit(this);
        _target.WriteLine(")");

        stmt.TrueBlock.Visit(this);

        if (stmt.FalseBlock != null)
        {
            _target.WriteLine("else");
            stmt.FalseBlock.Visit(this);
        }
        return true;
    }

    bool IAstStatementVisitor<bool>.Visit(AstForStatement stmt)
    {
        // TODO
    }

    // etc...
}

Note:

  • C# doesn't allow a void generic type parameter sobool is used and its return value is always ignored.
  • The interfaces are implemented explicitly which keeps the public API to the AstFormatter nice and clean.

Who Are These Mysterious Visitors?

To put into perspective just how important the visitor pattern is, here's the full list of visitors currently used in C-minor's compiler:

  • AstFormatter
  • CodeGenerator (C code)
  • ControlFlowAnalysis
  • HydrateTypes
  • NameResolver
  • ScaffoldTypes
  • StatementBinder
  • TopLevelStatements
  • ConstantExpressionEvaluator
  • SideEffects

Some of these classes are quite complex and without the visitor pattern their code would be scattered across many classes and files and we'd need to pass some sort of context class around to store the state of the operation.

The Down Side of the Visitor Pattern

There's only one small downside I've found with the visitor pattern - we can't pass parameters to the visit methods.

For example, in the ControlFlowAnalysis visitor there's an "entry state" that needs to be passed to each element as it's visited.

My solution is to store that information as a member of the operation class with a helper function that sets it before visiting the element. The Visit callback then picks it up again once it's been called.

// Helper function to visit a statement, passing an entryState
CFSTate Visit(AstStatement stmt, CFState entryState)
{
    _entryState = entryState;
    return stmt.Visit(this);
}

CFState Visit(AstIfStatement stmt)
{
    // Would be nicer if this was a parameter, but oh well.
    var entryState = _entryState;
}

It's a bit of a hacky solution, and I could update the visitor interfaces with at least one generic parameter, but I don't think it's worth it.

Enough Already!

Ok, so I think I've gone on enough about the visitor pattern, but just to sum up why it's used:

  • Code for each operation on the AST can be contained to a single class.
  • The AST elements know nothing about the operations being performed on them (this is good).
  • We can add operations without touching the AST classes.
  • It's fast since the dispatch is through virtual methods, not slow type tests.
  • The operations can implement the separate statement and/or expression node visitor patterns as required.
  • If we add a new AST element type, we have to add the associated method to the visitor interface in order to implement its Visit method... and then we're forced to implement the method on all classes. ie: it's type safe.

From now on, pretty much everything the C-minor compiler does centers around the AST - and the visitor pattern is the mechanism for doing it.