Let's Build an Interpreter in C: From Text to Execution

Ever typed python script.py or node index.js and wondered what magic happens under the bonnet? How does the computer understand your carefully written text and turn it into actions?

That “magic” is the work of an interpreter (or a compiler). It’s a program that reads your source code and executes it. And while it might seem like a dark art, the fundamental principles are surprisingly straightforward.

In this post, we’re going to demystify the process by building a simple scripting language interpreter from scratch, entirely in C. We’ll design our own language, “SimpleBASIC”, and then write the two most critical components: the Lexer and the Parser.

Let’s dive in.

🗺️ The Interpreter Roadmap

An interpreter’s job is to read text and do what it says. This process is typically broken into three stages.

Lexical Analysis (Lexing/Scanning): The “lexer” (or “tokeniser”) scans the raw source code string and groups characters into meaningful chunks called tokens. For example, LET a = 10 becomes KEYWORD_LET, IDENTIFIER('a'), OPERATOR_EQUALS, NUMBER(10).
Syntactical Analysis (Parsing): The “parser” takes this flat list of tokens and builds a data structure that represents the program’s logic and hierarchy. This is often an Abstract Syntax Tree (AST). The parser is also our “grammar police”—it ensures the tokens are in a valid order (e.g., LET = 10 a would be a syntax error).
Evaluation (The Interpreter): The “evaluator” (or “runtime”) “walks” the AST and executes the logic. When it sees a PrintStatement node, it prints. When it sees a BinaryExpression node, it performs the maths.

We will focus on steps 1 and 2, as they are the heart of understanding the source code.

1. Defining Our Language: “SimpleBASIC”

Before we can interpret a language, we need to invent one. Let’s keep it minimal. We’ll call it SimpleBASIC.

A SimpleBASIC script will look like this:

Basic

LET name = "World"
LET a = 10
LET b = a + 20

PRINT "Hello, "
PRINT name

IF b > 25 THEN
    PRINT "It is greater!"
ENDIF

Our language features:

Keywords: LET, PRINT, IF, THEN, ENDIF
Data Types: Numbers (integers) and Strings
Variables: Assigned with LET
Expressions: Simple arithmetic (+, -, *, /) and comparisons (>, <, ==)
Statements: Each statement is on its own line.

2. The Lexer (Tokeniser)

The lexer’s job is to chop the source code string into tokens. We need two C structures to manage this: one for the TokenType (an enum) and one for the Token itself.

The Token Structure

A token needs to know its type, the actual text it came from (the “lexeme”), and potentially its value (like the number 10, or the string “Hello”).

// lexer.h

// All the possible token types our language supports
typedef enum {
    // Single-character tokens
    TOKEN_PLUS, TOKEN_MINUS, TOKEN_STAR, TOKEN_SLASH,
    TOKEN_EQUALS, TOKEN_GREATER, TOKEN_LESS,
    TOKEN_LPAREN, TOKEN_RPAREN, // ( )
    TOKEN_NEWLINE, // We'll use this to separate statements

    // Literals
    TOKEN_IDENTIFIER, // variable names
    TOKEN_NUMBER,     // 10, 25, 3.14 (we'll stick to integers)
    TOKEN_STRING,     // "Hello"

    // Keywords
    TOKEN_LET, TOKEN_PRINT, TOKEN_IF, TOKEN_THEN, TOKEN_ENDIF,

    // Special
    TOKEN_ERROR, TOKEN_EOF // End Of File
} TokenType;

// The structure for a single token
typedef struct {
    TokenType type;
    const char* start; // Pointer to the beginning of the lexeme in the source
    int length;        // Length of the lexeme
    int line;
} Token;

The Lexer Implementation

The lexer itself is a struct that keeps track of its position in the source code. The main function, scan_token(), acts like a state machine. It advances one character at a time and decides what kind of token it’s looking at.

This is a simplified example. A production lexer would be more robust.

// lexer.c (simplified)
#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include "lexer.h"

typedef struct {
    const char* start;    // Start of the current lexeme
    const char* current;  // Current character being scanned
    int line;
} Lexer;

Lexer lexer;

// --- Helper Functions ---
static Token make_token(TokenType type) {
    Token token;
    token.type = type;
    token.start = lexer.start;
    token.length = (int)(lexer.current - lexer.start);
    token.line = lexer.line;
    return token;
}

static Token error_token(const char* message) {
    Token token;
    token.type = TOKEN_ERROR;
    token.start = message;
    token.length = (int)strlen(message);
    token.line = lexer.line;
    return token;
}

static char advance() {
    lexer.current++;
    return lexer.current[-1];
}

static int is_at_end() {
    return *lexer.current == '\0';
}

static char peek() {
    return *lexer.current;
}

static void skip_whitespace() {
    for (;;) {
        char c = peek();
        switch (c) {
            case ' ':
            case '\r':
            case '\t':
                advance();
                break;
            case '\n': // We treat newline as a meaningful token
                lexer.line++;
                return;
            default:
                return;
        }
    }
}

// --- Token Scanning Functions ---

static TokenType check_keyword(const char* start, int length, const char* rest, TokenType type) {
    if (lexer.current - lexer.start == length && memcmp(lexer.start, rest, length) == 0) {
        return type;
    }
    return TOKEN_IDENTIFIER;
}

static TokenType identifier_type() {
    // Check for keywords. This is a simple trie-like check.
    switch (lexer.start[0]) {
        case 'E': return check_keyword(lexer.start, 5, "ENDIF", TOKEN_ENDIF);
        case 'I': return check_keyword(lexer.start, 2, "IF", TOKEN_IF);
        case 'L': return check_keyword(lexer.start, 3, "LET", TOKEN_LET);
        case 'P': return check_keyword(lexer.start, 5, "PRINT", TOKEN_PRINT);
        case 'T': return check_keyword(lexer.start, 4, "THEN", TOKEN_THEN);
    }
    return TOKEN_IDENTIFIER;
}

static Token scan_identifier() {
    while (isalpha(peek()) || isdigit(peek()) || peek() == '_') advance();
    return make_token(identifier_type());
}

static Token scan_number() {
    while (isdigit(peek())) advance();
    return make_token(TOKEN_NUMBER);
}

static Token scan_string() {
    while (peek() != '"' && !is_at_end()) {
        if (peek() == '\n') lexer.line++;
        advance();
    }
    if (is_at_end()) return error_token("Unterminated string.");
    
    advance(); // Consume the closing quote
    return make_token(TOKEN_STRING);
}

// --- Public API ---
void init_lexer(const char* source) {
    lexer.start = source;
    lexer.current = source;
    lexer.line = 1;
}

Token scan_token() {
    skip_whitespace();
    lexer.start = lexer.current;

    if (is_at_end()) return make_token(TOKEN_EOF);

    char c = advance();

    // Check for identifiers first
    if (isalpha(c) || c == '_') return scan_identifier();
    // Check for numbers
    if (isdigit(c)) return scan_number();

    switch (c) {
        case '(': return make_token(TOKEN_LPAREN);
        case ')': return make_token(TOKEN_RPAREN);
        case '+': return make_token(TOKEN_PLUS);
        case '-': return make_token(TOKEN_MINUS);
        case '*': return make_token(TOKEN_STAR);
        case '/': return make_token(TOKEN_SLASH);
        case '>': return make_token(TOKEN_GREATER);
        case '<': return make_token(TOKEN_LESS);
        case '"': return scan_string();
        case '\n': return make_token(TOKEN_NEWLINE);

        case '=':
            // Check for '==' (a bit more robust)
            if (peek() == '=') {
                advance();
                return make_token(TOKEN_EQUALS); // Assuming '==' is TOKEN_EQUALS for simplicity
            }
            return make_token(TOKEN_EQUALS); // Or a separate TOKEN_ASSIGN

        default:
            return error_token("Unexpected character.");
    }
}

Now, if we feed our lexer the string LET a = 10\n, calling scan_token() repeatedly will produce:

{ type: TOKEN_LET, lexeme: "LET" }
{ type: TOKEN_IDENTIFIER, lexeme: "a" }
{ type: TOKEN_EQUALS, lexeme: "=" }
{ type: TOKEN_NUMBER, lexeme: "10" }
{ type: TOKEN_NEWLINE, lexeme: "\n" }
{ type: TOKEN_EOF, lexeme: "" }

3. The Parser (Building the AST)

The parser’s job is to take this flat list of tokens and build a tree that represents the code’s structure. This is the Abstract Syntax Tree (AST).

Why a tree? Because code is hierarchical. An IF statement contains other statements. An arithmetic operation 10 + 5 is a Binary node with two “children”: the literals 10 and 5.

The AST Structures

We need C structs to define the nodes of our tree. We’ll have two kinds: Statements (which do things) and Expressions (which produce values).

We use structs and pointers to build this tree. This is a classic application of tree data structures in C.

// ast.h

// --- Expression Nodes ---
// We need a base 'Expr' type. We can use a struct with an enum,
// or a more advanced 'visitor pattern'. We'll use a tagged union.

typedef struct Expr Expr; // Forward declaration

// An expression for a literal value (number, string)
typedef struct {
    // In a real C implementation, this would be a union
    int is_number; 
    union {
        double number; // Let's use doubles for numbers
        char* string;
    } as;
} LiteralExpr;

// An expression for a binary operation (e.g., a + b)
typedef struct {
    Expr* left;
    Token operator;
    Expr* right;
} BinaryExpr;

// An expression for a variable (e.g., 'a')
typedef struct {
    Token name;
} VariableExpr;

// Tagged union for all Expression types
typedef enum {
    EXPR_LITERAL,
    EXPR_BINARY,
    EXPR_VARIABLE
} ExprType;

struct Expr {
    ExprType type;
    union {
        LiteralExpr literal;
        BinaryExpr binary;
        VariableExpr variable;
    } as;
};

// --- Statement Nodes ---
// Statements form a list (the program)
typedef struct Stmt Stmt; // Forward declaration

typedef struct {
    Token name;
    Expr* initializer; // The expression after '='
} LetStmt;

typedef struct {
    Expr* expression;
} PrintStmt;

// Tagged union for all Statement types
typedef enum {
    STMT_LET,
    STMT_PRINT,
    // We would add STMT_IF here
} StmtType;

struct Stmt {
    StmtType type;
    union {
        LetStmt let;
        PrintStmt print;
    } as;

    Stmt* next; // Statements can be a linked list
};

Note: This code requires careful memory management (malloc for each node) which is omitted for brevity. You’d also need functions to “free” the AST.

The Parser Implementation

The parser consumes tokens from the lexer and builds the AST. We’ll write a Recursive Descent Parser. This just means we’ll have a set of functions that mirror our language’s grammar.

parse_statement() decides if we’re looking at a LET or PRINT statement.
parse_expression() handles arithmetic (this is where operator precedence is handled, e.g., * before +).

Here is a simplified example of a parser. It would live in parser.c.

// parser.c (conceptual, simplified)
#include "lexer.h"
#include "ast.h"
#include <stdlib.h> // for malloc

// The parser holds the current and previous tokens
typedef struct {
    Token current;
    Token previous;
    int had_error;
} Parser;

Parser parser;

// --- Parser Helper Functions ---
// (We need functions like advance(), check(), match(), consume(), error()...)
// ... these would call scan_token() from the lexer ...

// --- Grammar Functions (The Recursive Descent) ---

// Forward-declare so they can call each other
Expr* parse_expression(); 

// Handle literals, variables, and grouping
Expr* parse_primary() {
    if (match(TOKEN_NUMBER) || match(TOKEN_STRING)) {
        // Build a LiteralExpr node
        LiteralExpr* literal = (LiteralExpr*)malloc(sizeof(LiteralExpr));
        // ... (copy token value into 'literal->as') ...
        
        Expr* node = (Expr*)malloc(sizeof(Expr));
        node->type = EXPR_LITERAL;
        node->as.literal = *literal;
        return node;
    }
    
    if (match(TOKEN_IDENTIFIER)) {
        // Build a VariableExpr node
        // ... (code to create VariableExpr) ...
    }
    
    // ... handle ( ) grouping ...
    
    return NULL; // Error
}

// A full expression parser is complex (handles operator precedence)
// A simple one for 'a + b' (no precedence):
Expr* parse_expression() {
    Expr* left = parse_primary(); // e.g., 'a'

    if (match(TOKEN_PLUS) || match(TOKEN_MINUS)) {
        Token operator = parser.previous;
        Expr* right = parse_primary(); // e.g., 'b'
        
        // Build a BinaryExpr node
        BinaryExpr* binary = (BinaryExpr*)malloc(sizeof(BinaryExpr));
        binary->left = left;
        binary->operator = operator;
        binary->right = right;
        
        Expr* node = (Expr*)malloc(sizeof(Expr));
        node->type = EXPR_BINARY;
        node->as.binary = *binary;
        return node;
    }

    return left; // It was just a primary expression
}

// --- Statement Parsers ---

Stmt* parse_print_statement() {
    // We already consumed 'PRINT'. Now we parse the expression after it.
    Expr* value = parse_expression();
    consume(TOKEN_NEWLINE, "Expect newline after expression.");
    
    // Build the Stmt node
    PrintStmt* print = (PrintStmt*)malloc(sizeof(PrintStmt));
    print->expression = value;
    
    Stmt* node = (Stmt*)malloc(sizeof(Stmt));
    node->type = STMT_PRINT;
    node->as.print = *print;
    return node;
}

Stmt* parse_let_statement() {
    // We consumed 'LET'. Expect an identifier.
    Token name = consume(TOKEN_IDENTIFIER, "Expect variable name.");
    consume(TOKEN_EQUALS, "Expect '=' after variable name.");
    
    Expr* initializer = parse_expression();
    consume(TOKEN_NEWLINE, "Expect newline after expression.");
    
    // Build the Stmt node
    LetStmt* let = (LetStmt*)malloc(sizeof(LetStmt));
    let->name = name;
    let->initializer = initializer;

    Stmt* node = (Stmt*)malloc(sizeof(Stmt));
    node->type = STMT_LET;
    node->as.let = *let;
    return node;
}


Stmt* parse_statement() {
    if (match(TOKEN_PRINT)) {
        return parse_print_statement();
    }
    if (match(TOKEN_LET)) {
        return parse_let_statement();
    }
    // ...
    
    // If no match, it's an error or just an expression
    return NULL; 
}

// The main parse function
Stmt* parse(const char* source) {
    init_lexer(source);
    // ... setup parser ...
    
    Stmt* first_stmt = NULL;
    Stmt* last_stmt = NULL;

    while (!match(TOKEN_EOF)) {
        Stmt* stmt = parse_statement();
        if (stmt) {
            if (first_stmt == NULL) {
                first_stmt = stmt;
            } else {
                last_stmt->next = stmt;
            }
            last_stmt = stmt;
        }
    }
    return first_stmt; // Return the head of the linked list of statements
}

4. Step 3: The Evaluator (Running the Code)

We’ve done the hard part! We now have a list of Stmt structs (our AST). The final step is to “walk” this tree and execute it.

This is done with a function, let’s call it evaluate(), which takes an Expr or Stmt node and performs an action.

To evaluate a LetStmt: It would call evaluate() on the initializer expression to get its value, then store that value in a “symbol table” (like a hash map) with the variable’s name as the key.
To evaluate a PrintStmt: It would call evaluate() on its expression, get the resulting value (a number or string), and print it to the console using printf().
To evaluate a BinaryExpr: It would recursively call evaluate() on its left child, then its right child, and then perform the operation (e.g., add the two results).
To evaluate a LiteralExpr: It just returns the value (the number or string).
To evaluate a VariableExpr: It looks up the variable’s name in the symbol table and returns its stored value.

This stage is a classic example of the Visitor design pattern and is beautifully recursive.

🏁 Conclusion

We’ve just built the core of a real programming language interpreter. We’ve seen how to go from raw text to meaningful tokens (the lexer) and then from a flat list of tokens to a hierarchical AST (the parser).

From here, you can see how to extend it.

Add a WHILE loop (a new WhileStmt struct).
Add functions (a FuncStmt and CallExpr).
Improve the expression parser to handle operator precedence (this is a classic problem, often solved with a Pratt parser).

Writing an interpreter is one of the most rewarding projects in computer science. You touch on data structures, algorithms, memory management, and high-level design.

Let’s Build an Interpreter in C: From Text to Execution

🗺️ The Interpreter Roadmap

1. Defining Our Language: “SimpleBASIC”

2. The Lexer (Tokeniser)

The Token Structure

The Lexer Implementation

3. The Parser (Building the AST)

The AST Structures

The Parser Implementation

4. Step 3: The Evaluator (Running the Code)

🏁 Conclusion

hosting-panel

rpcemu-extended

oak-script

video-vault

riscos-mail-module

streamdeck-webmonitor

riscos-door-server

riscos-telnet-server

riscos-armbbs

riscos-armbbs-example-bbs

riscos-armbbs-door-libraries

riscos-cjson-library