Ever typed python script.py or node index.js and wondered what magic happens under the bonnet? How does the computer understand your carefully written text and turn it into actions?
That “magic” is the work of an interpreter (or a compiler). It’s a program that reads your source code and executes it. And while it might seem like a dark art, the fundamental principles are surprisingly straightforward.
In this post, we’re going to demystify the process by building a simple scripting language interpreter from scratch, entirely in C. We’ll design our own language, “SimpleBASIC”, and then write the two most critical components: the Lexer and the Parser.
Let’s dive in.
๐บ๏ธ The Interpreter Roadmap
An interpreter’s job is to read text and do what it says. This process is typically broken into three stages.
- Lexical Analysis (Lexing/Scanning): The “lexer” (or “tokeniser”) scans the raw source code string and groups characters into meaningful chunks called tokens. For example,
LET a = 10becomesKEYWORD_LET,IDENTIFIER('a'),OPERATOR_EQUALS,NUMBER(10). - Syntactical Analysis (Parsing): The “parser” takes this flat list of tokens and builds a data structure that represents the program’s logic and hierarchy. This is often an Abstract Syntax Tree (AST). The parser is also our “grammar police”โit ensures the tokens are in a valid order (e.g.,
LET = 10 awould be a syntax error). - Evaluation (The Interpreter): The “evaluator” (or “runtime”) “walks” the AST and executes the logic. When it sees a
PrintStatementnode, it prints. When it sees aBinaryExpressionnode, it performs the maths.
We will focus on steps 1 and 2, as they are the heart of understanding the source code.
1. Defining Our Language: “SimpleBASIC”
Before we can interpret a language, we need to invent one. Let’s keep it minimal. We’ll call it SimpleBASIC.
A SimpleBASIC script will look like this:
LET name = "World"
LET a = 10
LET b = a + 20
PRINT "Hello, "
PRINT name
IF b > 25 THEN
PRINT "It is greater!"
ENDIF
Our language features:
- Keywords:
LET,PRINT,IF,THEN,ENDIF - Data Types: Numbers (integers) and Strings
- Variables: Assigned with
LET - Expressions: Simple arithmetic (
+,-,*,/) and comparisons (>,<,==) - Statements: Each statement is on its own line.
2. The Lexer (Tokeniser)
The lexer’s job is to chop the source code string into tokens. We need two C structures to manage this: one for the TokenType (an enum) and one for the Token itself.
The Token Structure
A token needs to know its type, the actual text it came from (the “lexeme”), and potentially its value (like the number 10, or the string “Hello”).
// lexer.h
// All the possible token types our language supports
typedef enum {
// Single-character tokens
TOKEN_PLUS, TOKEN_MINUS, TOKEN_STAR, TOKEN_SLASH,
TOKEN_EQUALS, TOKEN_GREATER, TOKEN_LESS,
TOKEN_LPAREN, TOKEN_RPAREN, // ( )
TOKEN_NEWLINE, // We'll use this to separate statements
// Literals
TOKEN_IDENTIFIER, // variable names
TOKEN_NUMBER, // 10, 25, 3.14 (we'll stick to integers)
TOKEN_STRING, // "Hello"
// Keywords
TOKEN_LET, TOKEN_PRINT, TOKEN_IF, TOKEN_THEN, TOKEN_ENDIF,
// Special
TOKEN_ERROR, TOKEN_EOF // End Of File
} TokenType;
// The structure for a single token
typedef struct {
TokenType type;
const char* start; // Pointer to the beginning of the lexeme in the source
int length; // Length of the lexeme
int line;
} Token;
The Lexer Implementation
The lexer itself is a struct that keeps track of its position in the source code. The main function, scan_token(), acts like a state machine. It advances one character at a time and decides what kind of token it’s looking at.
This is a simplified example. A production lexer would be more robust.
// lexer.c (simplified)
#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include "lexer.h"
typedef struct {
const char* start; // Start of the current lexeme
const char* current; // Current character being scanned
int line;
} Lexer;
Lexer lexer;
// --- Helper Functions ---
static Token make_token(TokenType type) {
Token token;
token.type = type;
token.start = lexer.start;
token.length = (int)(lexer.current - lexer.start);
token.line = lexer.line;
return token;
}
static Token error_token(const char* message) {
Token token;
token.type = TOKEN_ERROR;
token.start = message;
token.length = (int)strlen(message);
token.line = lexer.line;
return token;
}
static char advance() {
lexer.current++;
return lexer.current[-1];
}
static int is_at_end() {
return *lexer.current == '\0';
}
static char peek() {
return *lexer.current;
}
static void skip_whitespace() {
for (;;) {
char c = peek();
switch (c) {
case ' ':
case '\r':
case '\t':
advance();
break;
case '\n': // We treat newline as a meaningful token
lexer.line++;
return;
default:
return;
}
}
}
// --- Token Scanning Functions ---
static TokenType check_keyword(const char* start, int length, const char* rest, TokenType type) {
if (lexer.current - lexer.start == length && memcmp(lexer.start, rest, length) == 0) {
return type;
}
return TOKEN_IDENTIFIER;
}
static TokenType identifier_type() {
// Check for keywords. This is a simple trie-like check.
switch (lexer.start[0]) {
case 'E': return check_keyword(lexer.start, 5, "ENDIF", TOKEN_ENDIF);
case 'I': return check_keyword(lexer.start, 2, "IF", TOKEN_IF);
case 'L': return check_keyword(lexer.start, 3, "LET", TOKEN_LET);
case 'P': return check_keyword(lexer.start, 5, "PRINT", TOKEN_PRINT);
case 'T': return check_keyword(lexer.start, 4, "THEN", TOKEN_THEN);
}
return TOKEN_IDENTIFIER;
}
static Token scan_identifier() {
while (isalpha(peek()) || isdigit(peek()) || peek() == '_') advance();
return make_token(identifier_type());
}
static Token scan_number() {
while (isdigit(peek())) advance();
return make_token(TOKEN_NUMBER);
}
static Token scan_string() {
while (peek() != '"' && !is_at_end()) {
if (peek() == '\n') lexer.line++;
advance();
}
if (is_at_end()) return error_token("Unterminated string.");
advance(); // Consume the closing quote
return make_token(TOKEN_STRING);
}
// --- Public API ---
void init_lexer(const char* source) {
lexer.start = source;
lexer.current = source;
lexer.line = 1;
}
Token scan_token() {
skip_whitespace();
lexer.start = lexer.current;
if (is_at_end()) return make_token(TOKEN_EOF);
char c = advance();
// Check for identifiers first
if (isalpha(c) || c == '_') return scan_identifier();
// Check for numbers
if (isdigit(c)) return scan_number();
switch (c) {
case '(': return make_token(TOKEN_LPAREN);
case ')': return make_token(TOKEN_RPAREN);
case '+': return make_token(TOKEN_PLUS);
case '-': return make_token(TOKEN_MINUS);
case '*': return make_token(TOKEN_STAR);
case '/': return make_token(TOKEN_SLASH);
case '>': return make_token(TOKEN_GREATER);
case '<': return make_token(TOKEN_LESS);
case '"': return scan_string();
case '\n': return make_token(TOKEN_NEWLINE);
case '=':
// Check for '==' (a bit more robust)
if (peek() == '=') {
advance();
return make_token(TOKEN_EQUALS); // Assuming '==' is TOKEN_EQUALS for simplicity
}
return make_token(TOKEN_EQUALS); // Or a separate TOKEN_ASSIGN
default:
return error_token("Unexpected character.");
}
}
Now, if we feed our lexer the string LET a = 10\n, calling scan_token() repeatedly will produce:
{ type: TOKEN_LET, lexeme: "LET" }{ type: TOKEN_IDENTIFIER, lexeme: "a" }{ type: TOKEN_EQUALS, lexeme: "=" }{ type: TOKEN_NUMBER, lexeme: "10" }{ type: TOKEN_NEWLINE, lexeme: "\n" }{ type: TOKEN_EOF, lexeme: "" }
3. The Parser (Building the AST)
The parser’s job is to take this flat list of tokens and build a tree that represents the code’s structure. This is the Abstract Syntax Tree (AST).
Why a tree? Because code is hierarchical. An IF statement contains other statements. An arithmetic operation 10 + 5 is a Binary node with two “children”: the literals 10 and 5.
The AST Structures
We need C structs to define the nodes of our tree. We’ll have two kinds: Statements (which do things) and Expressions (which produce values).
We use structs and pointers to build this tree. This is a classic application of tree data structures in C.
// ast.h
// --- Expression Nodes ---
// We need a base 'Expr' type. We can use a struct with an enum,
// or a more advanced 'visitor pattern'. We'll use a tagged union.
typedef struct Expr Expr; // Forward declaration
// An expression for a literal value (number, string)
typedef struct {
// In a real C implementation, this would be a union
int is_number;
union {
double number; // Let's use doubles for numbers
char* string;
} as;
} LiteralExpr;
// An expression for a binary operation (e.g., a + b)
typedef struct {
Expr* left;
Token operator;
Expr* right;
} BinaryExpr;
// An expression for a variable (e.g., 'a')
typedef struct {
Token name;
} VariableExpr;
// Tagged union for all Expression types
typedef enum {
EXPR_LITERAL,
EXPR_BINARY,
EXPR_VARIABLE
} ExprType;
struct Expr {
ExprType type;
union {
LiteralExpr literal;
BinaryExpr binary;
VariableExpr variable;
} as;
};
// --- Statement Nodes ---
// Statements form a list (the program)
typedef struct Stmt Stmt; // Forward declaration
typedef struct {
Token name;
Expr* initializer; // The expression after '='
} LetStmt;
typedef struct {
Expr* expression;
} PrintStmt;
// Tagged union for all Statement types
typedef enum {
STMT_LET,
STMT_PRINT,
// We would add STMT_IF here
} StmtType;
struct Stmt {
StmtType type;
union {
LetStmt let;
PrintStmt print;
} as;
Stmt* next; // Statements can be a linked list
};
Note: This code requires careful memory management (malloc for each node) which is omitted for brevity. You’d also need functions to “free” the AST.
The Parser Implementation
The parser consumes tokens from the lexer and builds the AST. We’ll write a Recursive Descent Parser. This just means we’ll have a set of functions that mirror our language’s grammar.
parse_statement()decides if we’re looking at aLETorPRINTstatement.parse_expression()handles arithmetic (this is where operator precedence is handled, e.g.,*before+).
Here is a simplified example of a parser. It would live in parser.c.
// parser.c (conceptual, simplified)
#include "lexer.h"
#include "ast.h"
#include <stdlib.h> // for malloc
// The parser holds the current and previous tokens
typedef struct {
Token current;
Token previous;
int had_error;
} Parser;
Parser parser;
// --- Parser Helper Functions ---
// (We need functions like advance(), check(), match(), consume(), error()...)
// ... these would call scan_token() from the lexer ...
// --- Grammar Functions (The Recursive Descent) ---
// Forward-declare so they can call each other
Expr* parse_expression();
// Handle literals, variables, and grouping
Expr* parse_primary() {
if (match(TOKEN_NUMBER) || match(TOKEN_STRING)) {
// Build a LiteralExpr node
LiteralExpr* literal = (LiteralExpr*)malloc(sizeof(LiteralExpr));
// ... (copy token value into 'literal->as') ...
Expr* node = (Expr*)malloc(sizeof(Expr));
node->type = EXPR_LITERAL;
node->as.literal = *literal;
return node;
}
if (match(TOKEN_IDENTIFIER)) {
// Build a VariableExpr node
// ... (code to create VariableExpr) ...
}
// ... handle ( ) grouping ...
return NULL; // Error
}
// A full expression parser is complex (handles operator precedence)
// A simple one for 'a + b' (no precedence):
Expr* parse_expression() {
Expr* left = parse_primary(); // e.g., 'a'
if (match(TOKEN_PLUS) || match(TOKEN_MINUS)) {
Token operator = parser.previous;
Expr* right = parse_primary(); // e.g., 'b'
// Build a BinaryExpr node
BinaryExpr* binary = (BinaryExpr*)malloc(sizeof(BinaryExpr));
binary->left = left;
binary->operator = operator;
binary->right = right;
Expr* node = (Expr*)malloc(sizeof(Expr));
node->type = EXPR_BINARY;
node->as.binary = *binary;
return node;
}
return left; // It was just a primary expression
}
// --- Statement Parsers ---
Stmt* parse_print_statement() {
// We already consumed 'PRINT'. Now we parse the expression after it.
Expr* value = parse_expression();
consume(TOKEN_NEWLINE, "Expect newline after expression.");
// Build the Stmt node
PrintStmt* print = (PrintStmt*)malloc(sizeof(PrintStmt));
print->expression = value;
Stmt* node = (Stmt*)malloc(sizeof(Stmt));
node->type = STMT_PRINT;
node->as.print = *print;
return node;
}
Stmt* parse_let_statement() {
// We consumed 'LET'. Expect an identifier.
Token name = consume(TOKEN_IDENTIFIER, "Expect variable name.");
consume(TOKEN_EQUALS, "Expect '=' after variable name.");
Expr* initializer = parse_expression();
consume(TOKEN_NEWLINE, "Expect newline after expression.");
// Build the Stmt node
LetStmt* let = (LetStmt*)malloc(sizeof(LetStmt));
let->name = name;
let->initializer = initializer;
Stmt* node = (Stmt*)malloc(sizeof(Stmt));
node->type = STMT_LET;
node->as.let = *let;
return node;
}
Stmt* parse_statement() {
if (match(TOKEN_PRINT)) {
return parse_print_statement();
}
if (match(TOKEN_LET)) {
return parse_let_statement();
}
// ...
// If no match, it's an error or just an expression
return NULL;
}
// The main parse function
Stmt* parse(const char* source) {
init_lexer(source);
// ... setup parser ...
Stmt* first_stmt = NULL;
Stmt* last_stmt = NULL;
while (!match(TOKEN_EOF)) {
Stmt* stmt = parse_statement();
if (stmt) {
if (first_stmt == NULL) {
first_stmt = stmt;
} else {
last_stmt->next = stmt;
}
last_stmt = stmt;
}
}
return first_stmt; // Return the head of the linked list of statements
}
4. Step 3: The Evaluator (Running the Code)
We’ve done the hard part! We now have a list of Stmt structs (our AST). The final step is to “walk” this tree and execute it.
This is done with a function, let’s call it evaluate(), which takes an Expr or Stmt node and performs an action.
- To evaluate a
LetStmt: It would callevaluate()on theinitializerexpression to get its value, then store that value in a “symbol table” (like a hash map) with the variable’s name as the key. - To evaluate a
PrintStmt: It would callevaluate()on itsexpression, get the resulting value (a number or string), and print it to the console usingprintf(). - To evaluate a
BinaryExpr: It would recursively callevaluate()on itsleftchild, then itsrightchild, and then perform the operation (e.g., add the two results). - To evaluate a
LiteralExpr: It just returns the value (the number or string). - To evaluate a
VariableExpr: It looks up the variable’s name in the symbol table and returns its stored value.
This stage is a classic example of the Visitor design pattern and is beautifully recursive.
๐ Conclusion
We’ve just built the core of a real programming language interpreter. We’ve seen how to go from raw text to meaningful tokens (the lexer) and then from a flat list of tokens to a hierarchical AST (the parser).
From here, you can see how to extend it.
- Add a
WHILEloop (a newWhileStmtstruct). - Add functions (a
FuncStmtandCallExpr). - Improve the expression parser to handle operator precedence (this is a classic problem, often solved with a Pratt parser).
Writing an interpreter is one of the most rewarding projects in computer science. You touch on data structures, algorithms, memory management, and high-level design.