LLVM's Kaleidoscope in Swift - Part 0: Prelude
When deciding to write this story I was expecting to start by saying something along the lines of “I know there are already many tutorials for writing Kaleidoscope in Swift, but here’s another one.” - Turns out, there’s not. The only one I could find was this one by Harlan Haskins. And while Harlan’s tutorial is definitely worth checking out, it sometimes glosses over details which I had to figure out by using other resources. What I therefore hope to do with this series of posts, is to create a unified resource for learning how to implement LLVM’s tutorial language Kaleidoscope. The goal of this series is of course not just to create a working compiler for Kaleidoscope, but to learn how compilers can be built in general using the LLVM infrastructure.
Knowledge of the Swift programming language is not necessarily required, but will definitely be helpful for this specific tutorial. If you prefer another language, you probably won’t be hard pressed to find a Kaleidoscope tutorial for it as well.
As a short disclaimer - I’m no expert on the topics I will be writing about. I’m going to try to explain my approach to learning about them, but I’ll probably make mistakes. If you feel like something does not sound correct, do go and consult other resources! If you find mistakes, please report them on this post’s GitHub page. Thanks!
If you’ve stumbled upon this post without knowing what LLVM is, this section is for you - if you already know, skip to the next one. →
In the book The Architecture of Open Source Applications Chris Lattner, the original creator of LLVM, has written a chapter introducing the LLVM project to newcomers. In it he explains a fundamental problem with compiler design, that LLVM intends to solve:
Let’s assume you want to write a new programming language. The way you would traditionally go about this is by writing
- a frontend able to convert source code into some structured, internal representation
- an optimizer which changes the code to run more efficiently
- a backend which generates the machine code instructions for each target platform (x86, ARM, PowerPC, …)
While this design works well for any given language, it is rather inefficient when considering the space of all languages. The inefficiency really results from steps 2 and 3.
Writing a good optimizer is no trivial task. First of all, you have to be able to figure out what you can change about a program’s code, without changing its behavior. And once you know what you can safely change in a program’s code, you have to figure out how to change it in order to increase efficiency. This can be a deeply mathematical process and is therefore not well suited for casual programmers.
Writing compiler backends produces a different problem: there are many target architectures. If you want to create a programming language that will actually be relevant, your backend needs to be able to generate machine instructions for as many of them as possible. The following list of LLVM’s supported target architectures should give you an idea of why this might be difficult though:
marcus@~: llc --version LLVM (http://llvm.org/): LLVM version 10.0.0 Optimized build. Default target: x86_64-apple-darwin19.5.0 Host CPU: skylake Registered Targets: aarch64 - AArch64 (little endian) aarch64_32 - AArch64 (little endian ILP32) aarch64_be - AArch64 (big endian) amdgcn - AMD GCN GPUs arm - ARM arm64 - ARM64 (little endian) arm64_32 - ARM64 (little endian ILP32) armeb - ARM (big endian) bpf - BPF (host endian) bpfeb - BPF (big endian) bpfel - BPF (little endian) hexagon - Hexagon lanai - Lanai mips - MIPS (32-bit big endian) mips64 - MIPS (64-bit big endian) mips64el - MIPS (64-bit little endian) mipsel - MIPS (32-bit little endian) msp430 - MSP430 [experimental] nvptx - NVIDIA PTX 32-bit nvptx64 - NVIDIA PTX 64-bit ppc32 - PowerPC 32 ppc64 - PowerPC 64 ppc64le - PowerPC 64 LE r600 - AMD GPUs HD2XXX-HD6XXX riscv32 - 32-bit RISC-V riscv64 - 64-bit RISC-V sparc - Sparc sparcel - Sparc LE sparcv9 - Sparc V9 systemz - SystemZ thumb - Thumb thumbeb - Thumb (big endian) wasm32 - WebAssembly 32-bit wasm64 - WebAssembly 64-bit x86 - 32-bit X86: Pentium-Pro and above x86-64 - 64-bit X86: EM64T and AMD64 xcore - XCore
What LLVM therefore aims to do, is to remove steps 2 and 3 from your compiler design process. Neither do you need to be able to write an optimizer, nor a backend targeting over 30 different architectures. LLVM just does that for you:
You do need to help out LLVM a bit though, by passing your program to it using the LLVM Intermediate Representation (LLVM IR). Because, just as step 1 required us to create some “structured, internal representation” of our program, LLVM also needs some internal representation of a program in order to reason about it.
So all we have to do now in order to create a working compiler is to write a frontend and generate LLVM IR. And in the process we get the power of a top-of-the-line optimizer as well as a host of target architectures.
Not without reason do notable companies rely and work on LLVM.
This section begins with some rather theoretical aspects of programming languages. It is in no way required for the rest of the tutorial, so if you’d rather skip it, go right ahead. →
A programming language is what is known as a “formal language” in theoretical computer science. Wikipedia defines a formal language as:
… consist[ing] of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules.
While this might sound a bit vague, it actually tells us a lot about the structure of formal languages. Let’s unpack it:
The alphabet of a formal language is the set of symbols (characters) that are allowed within the language. This should be very intuitive when comparing to spoken languages. For example, words in the English language will never contain the symbol
ö, because it is not part of the English alphabet. German on the other hand has
ö in its alphabet, which is why German words can use that symbol. It’s that simple.
Words are again very well explained by analogy to spoken language. They are sequences of those symbols contained in the language’s alphabet. Now this does not mean that every sequence of symbols will create a valid word. For example, the sequences
follow both consist entirely of symbols contained in the English alphabet. But only the sequence
follow is a word in the English language.
This brings us to the main difference between natural spoken languages and formal languages.
The fact that
follow is a word in the English language while
folge is not, is rather arbitrary. This is why we need long dictionaries to describe which words belong to the English language - there’s no system.
Formal languages on the other hand do have a system for describing which sequences of symbols are words in a language, and which aren’t. They use formation rules to describe how to generate each and every single word in a language, an not a single word more.
Let’s look at some examples using BNF notation:
<digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
This is a formation rule which defines what a
digit is (syntactically). It states that a
digit can either be the symbol
2 … or
9. That’s it. In a way we have defined the language of digits, by simply listing all of its words. If someone would give us the sequence
123, we could deduce that it is not a valid digit, as the formation rule above does not allow us to generate
So what if we want to define the language of integers? We could build off of our previous formation rule for
digit and add the following:
<integer> ::= <digit> | <digit> <integer>
This formation rule now tells us that an
integer is either a
digit (as we have defined previously), or it is a
digit followed by another
integer. As you can see, formation rules can be recursive. This allows us to generate words of different lengths.
The formation rule for
digit constrained our words to just be single symbols. But now we can for example derive that
123 is part of the language of integers, by generating it the following way:
<integer> → <digit> <integer> → "1" <integer> → "1" <digit> <integer> → "1" "2" <integer> → "1" "2" <digit> → "1" "2" "3"
By combining multiple formation rules we have started building a language grammar. Programming languages are also defined by a grammar.
<prototype> ::= <identifier> "(" <params> ")" <params> ::= <identifier> | <identifier> "," <params> <definition> ::= "def" <prototype> <expr> ";" <extern> ::= "extern" <prototype> ";" <operator> ::= "+" | "-" | "*" | "/" | "%" <expr> ::= <binary> | <call> | <identifier> | <number> | <ifelse> | "(" <expr> ")" <binary> ::= <expr> <operator> <expr> <call> ::= <identifier> "(" <arguments> ")" <ifelse> ::= "if" <expr> "then" <expr> "else" <expr> <arguments> ::= <expr> | <expr> "," <arguments>
And just for completeness, I’m going to add the following rule:
<kaleidoscope> ::= <prototype> | <definition> | <expr> | <prototype> <kaleidoscope> | <definition> <kaleidoscope> | <expr> <kaleidoscope>
This grammar defines a language that is slightly different from the one defined in LLVM’s own tutorial. E.g. we do not have a
<operator in our variant.
kaleidoscope now defines the entirety of the Kaleidoscope language. Every valid Kaleidoscope program can be generated by starting with the
kaleidoscope symbol and performing substitutions according to the formation rules. (Some invalid programs can be generated as well, which we will deal with later.)
The rules below define the symbols
number, as they are not actually defined above.
<identifier> ::= <identifier-head> | <identifier> <identifier-body> <identifier-body> ::= <identifier-head> | <digit> <identifier-head> ::= #Foundation.CharacterSet.letter# | "_" <number> ::= <digits> | <digits> "." <digits> <digits> ::= <digit> | <digit> <digits> <digit> ::= "0" | "1" | ... | "9"
I’m assuming Harlan didn’t include these rules explicitly, because we usually have a pretty intuitive understanding of what identifiers and numbers are.
The reason I explicitly included these rules though, is because they roughly show how our compiler frontend will be divided.
Setting up the Project
We will be using Swift Package Manager for this project. If you are not familiar with SPM, I recommend checking it out. You can always just follow what I’m doing though, so an understanding of SPM is not required.
We will also be using the LLVM API via its C-bindings, so you will need to download and install LLVM. This will only become relevant once we start writing the IR Generator though, so I will defer explanation of how to set it up until then.
To setup a Swift package, create a new directory for the project, and initialize a new package from within the directory:
marcus@~: mkdir Kaleidoscope marcus@~: cd Kaleidoscope marcus@Kaleidoscope: swift package init Creating library package: Kaleidoscope Creating Package.swift Creating README.md Creating .gitignore Creating Sources/ Creating Sources/Kaleidoscope/Kaleidoscope.swift Creating Tests/ Creating Tests/LinuxMain.swift Creating Tests/KaleidoscopeTests/ Creating Tests/KaleidoscopeTests/KaleidoscopeTests.swift Creating Tests/KaleidoscopeTests/XCTestManifests.swift
If you want to use Xcode as your IDE, you can call
swift package generate-xcodeproj to generate a
.xcodeproj file for this project. As Xcode does not necessarily update according to the changes we’re about to make, I recommend doing this after we’ve completed the following steps.
We will split our package into two targets,
Kaleidoscope which will become an executable (the compiler), and
KaleidoscopeLib which will be the library containing almost all of the logic of the compiler. We need to do this, because testing an executable is currently not possible using SPM. So our
Package.swift manifest file has to look like this:
// swift-tools-version:5.1 import PackageDescription let package = Package( name: "Kaleidoscope", products: [ .executable( name: "Kaleidoscope", targets: ["Kaleidoscope"] ), ], dependencies: [ ], targets: [ .target( name: "KaleidoscopeLib", dependencies:  ), .target( name: "Kaleidoscope", dependencies: ["KaleidoscopeLib"] ), .testTarget( name: "KaleidoscopeLibTests", dependencies: ["KaleidoscopeLib"] ), ] )
We need to reflect the target structure, in the folder structure of our project. So in the package’s
Sources directory, we will need a
Kaleidoscope and a
marcus@Kaleidoscope: cd Sources/ marcus@Sources: mkdir KaleidoscopeLib marcus@Sources: ls Kaleidoscope KaleidoscopeLib
Satisfying Swift PM
If we tried to create an Xcode project from this package, we’d run into some errors. That’s because there are some subtle requirements we are yet to fulfill:
error: Source files for target KaleidoscopeLibTests should be located under 'Sources/KaleidoscopeLibTests', or a custom sources path can be set with the 'path' property in Package.swift
This error is rather self-explanitory. We need to change all of the references to “KaleidoscopeTests” to be “KaleidoscopeLibTests” instead:
marcus@Sources: cd ../Tests marcus@Kaleidoscope: mv KaleidoscopeTests KaleidoscopeLibTests marcus@Kaleidoscope: cd KaleidoscopeLibTests marcus@KaleidoscopeLibTests: mv KaleidoscopeTests.swift KaleidoscopeLibTests.swift
error: executable product 'Kaleidoscope' should have exactly one executable target
This error explains the problem pretty well, but not the solution. In a Swift package every executable product must contain exactly one executable target, so that Swift knows where to begin execution when running the product. But what makes a target executable? - The presence of a
Kaleidoscope target currently only contains
Kaleidoscope.swift, so again we simply have to rename a file to fix the problem:
marcus@KaleidoscopeLibTests: cd ../../Kaleidoscope marcus@Kaleidoscope: mv Kaleidoscope.swift main.swift
As mentioned in the section about Kaleidoscope’s grammar above, our compiler frontend will be divided into multiple components. Those are
- IR Generator
Accordingly we can create a source file for each of these components in the package’s
marcus@Sources: cd KaleidoscopeLib/ marcus@KaleidoscopeLib: touch Lexer.swift Parser.swift IRGenerator.swift
The lexer will be responsible for turning raw Kaleidoscope source code into syntactical units called tokens. Tokens will be things like number literals, keywords, and identifiers.
The parser will use those tokens to compose higher-level symbols such as expressions, prototype declarations and function definitions.
Lastly the IR Generator will use symbols produced by the parser to generate LLVM IR that fulfills the semantics associated with those symbols.
Don’t worry if you didn’t understand any of that yet - we will go into more detail about each of these components in their respective posts.
Until then, thanks for reading!
A full code listing for this post can be found here.