docs

How it works

Overview

kemlang-py is a tree-walking interpreter. This section explains exactly how it turns a .jsk source file into running output - from character scanning all the way to executing statements.

What is a programming language, really?

A programming language is a convention. The source file you write is just text - a sequence of Unicode characters sitting on disk. Nothing in the hardware understands bhai bol. The interpreter is the program that reads that text and figures out what to do with it.

Every interpreter or compiler does the same fundamental job: transform source text into behavior. The strategies differ enormously in complexity and performance, but the goal is always the same.

The spectrum of language implementations

Different languages take different approaches to turning source into execution. The main strategies are compiled native code, bytecode + virtual machine, and tree-walking interpretation.

language implementation spectrum

  Source text
      │
      ▼
  ┌───────────────────────────────────────────────────────────────────┐
  │  COMPILED  (C, Rust, Go)                                          │
  │                                                                   │
  │  Source ──▶ Compiler ──▶ Machine code (.exe / .o) ──▶ CPU runs   │
  │                                                                   │
  │  + Fastest possible execution (direct CPU instructions)           │
  │  - Compilation takes time; separate step before running           │
  └───────────────────────────────────────────────────────────────────┘

  ┌───────────────────────────────────────────────────────────────────┐
  │  BYTECODE VM  (Python, Java, Lua)                                 │
  │                                                                   │
  │  Source ──▶ Compiler ──▶ Bytecode ──▶ Virtual Machine ──▶ output │
  │                          (.pyc)       (CPython, JVM)              │
  │                                                                   │
  │  + Faster than tree-walking; portable across platforms            │
  │  - VM adds complexity; bytecode is an intermediate layer          │
  └───────────────────────────────────────────────────────────────────┘

  ┌───────────────────────────────────────────────────────────────────┐
  │  TREE-WALKING  (kemlang-py, early Ruby, many scripting languages) │
  │                                                                   │
  │  Source ──▶ Lexer ──▶ Parser ──▶ AST ──▶ walk & execute          │
  │                                                                   │
  │  + Simplest implementation; easy to debug and extend              │
  │  - Slowest; each node is re-evaluated on every visit              │
  └───────────────────────────────────────────────────────────────────┘

kemlang-py is a tree-walking interpreter. It never produces compiled output - it reads your source file and executes it directly by walking the parsed tree. This makes the implementation small (~1000 lines across three core files), readable, and easy to modify.

The pipeline

Every time you run kem run hello.jsk, the source file travels through three sequential stages. Each stage receives the output of the previous one. Nothing is shared between stages except that single transformed value.

the full pipeline

  ┌──────────────────────────────────────────────────────────────────┐
  │  Source file  (hello.jsk)                                        │
  │                                                                  │
  │  kem bhai                                                        │
  │    bhai bol "kem cho, duniya!"                                   │
  │  aavjo bhai                                                      │
  └────────────────────────────┬─────────────────────────────────────┘
                               │  raw UTF-8 text
                               ▼
  ┌──────────────────────────────────────────────────────────────────┐
  │  Stage 1: Lexer  (kemlang/lexer.py)                              │
  │                                                                  │
  │  Scans characters left-to-right. Groups them into tokens.        │
  │  Handles multi-word Gujarati keywords. Skips whitespace.         │
  └────────────────────────────┬─────────────────────────────────────┘
                               │  List[Token]
                               │
                               │  KEM_BHAI    'kem bhai'              1:0
                               │  BHAI_BOL    'bhai bol'              2:2
                               │  STRING      '"kem cho, duniya!"'    2:10
                               │  AAVJO_BHAI  'aavjo bhai'           3:0
                               │  EOF         ''                      4:0
                               ▼
  ┌──────────────────────────────────────────────────────────────────┐
  │  Stage 2: Parser  (kemlang/parser.py)                            │
  │                                                                  │
  │  Consumes tokens one at a time. Checks grammar rules.            │
  │  Builds a tree of dataclass nodes (the AST).                     │
  └────────────────────────────┬─────────────────────────────────────┘
                               │  Program (AST)
                               │
                               │  Program
                               │  └── Print
                               │      └── Literal("kem cho, duniya!")
                               ▼
  ┌──────────────────────────────────────────────────────────────────┐
  │  Stage 3: Interpreter  (kemlang/interpreter.py)                  │
  │                                                                  │
  │  Walks the AST recursively. Executes each node. Manages          │
  │  variable scope via Environment. Handles I/O and errors.         │
  └────────────────────────────┬─────────────────────────────────────┘
                               │
                               ▼
                       stdout: kem cho, duniya!
                       exit code: 0

Stage 1: Lexer

The lexer (also called a scanner or tokenizer) reads the source string one character at a time and groups characters into tokens. A token is the smallest meaningful unit of the language - a keyword, a number, a string literal, an operator, or an identifier.

kemlang-py's lexer handles something unusual: multi-word keywords. Most languages use single reserved words (if, while,print). kemlang-py uses natural Gujarati phrases like bhai bol (print) and aavjo bhai (end of program). The lexer checks for these multi-word sequences before checking single-word keywords.

Input: raw source text as a Python str. Output: List[Token], each token carrying its type, lexeme (the original text), line number, and column offset.

Deep dive: The Lexer

Stage 2: Parser

The parser takes the flat token stream and builds a tree structure called an Abstract Syntax Tree (AST). The tree represents the grammatical structure of your program - nesting, operator precedence, and the parent-child relationships between statements and expressions.

kemlang-py uses a hand-written recursive-descent parser. Each grammar rule has a corresponding method (statement(), if_statement(),expression(), etc.). These methods call each other recursively, naturally mirroring the nested structure of the grammar.

Input: List[Token] (with NEWLINE tokens stripped). Output: a Program dataclass containing a list of Stmt nodes, each of which may contain Expr nodes.

Deep dive: The Parser

Stage 3: Interpreter

The interpreter walks the AST recursively. For each node it visits, it calls the appropriate handler. Statement nodes (Print, Declaration, If, While) produce side effects - they print to stdout, define variables, branch, or loop. Expression nodes (Binary, Variable, Literal) return a KemValue - one of Python's five built-in types.

Variable scope is managed through a chain of Environmentobjects. Each block (if body, while body) gets its own environment that holds a reference to its parent. When a variable lookup fails in the current environment, it walks up the chain.

Input: Program (root AST node). Output: stdout text + an integer exit code (0 = success, 1 = error).

Deep dive: The Interpreter

What the CLI actually does

The kem run command in kemlang/cli.pycalls these three stages in sequence. The entire pipeline is five lines:

kemlang/cli.py - kem run (simplified)
source    = Path(file).read_text(encoding="utf-8")
tokens    = Lexer(source).tokenize()          # str  -> List[Token]
ast       = Parser(tokens).parse()            # tokens -> Program
exit_code = Interpreter().interpret(ast)      # AST -> stdout + int
raise typer.Exit(exit_code)

Explore each stage in detail