How it works
The Lexer
The lexer is the first stage of the pipeline. It reads raw source text one character at a time and produces a stream of tokens - the atoms from which the parser builds a program tree.
Input
str
raw source text
Output
List[Token]
ordered token stream
Source
lexer.py
256 lines
What is a lexer?
A lexer (also called a scanner or tokenizer) converts a flat string of characters into a structured sequence of tokens. The parser cannot work directly on raw text - it needs discrete, labeled units it can reason about grammatically.
Think of it like reading a sentence. Before you can parse "the cat sat on the mat" into subject-verb-object structure, your brain first groups the characters into words. The lexer does the same thing for source code.
What is a token?
A token is a small, labeled chunk of source text. In kemlang-py, every token carries four pieces of information:
@dataclass
class Token:
type: TokenType # what kind of thing this is
lexeme: str # the exact source text e.g. "bhai bol"
line: int # 1-indexed line number in the source file
col: int # 0-indexed column offset on that line
literal: Any = None # parsed value for strings/numbersThe type field is a TokenTypeenum member. There are 36 token types total in kemlang-py: 16 keyword types, 5 literal types, 11 operator types, 4 delimiter types, and EOF.
How the scanning loop works
The lexer maintains a cursor (self.current) that advances through the source string. On each iteration of the main loop, it calls scan_token(), which reads the character at the cursor, decides what kind of token starts here, and advances the cursor past it.
def tokenize(self) -> list[Token]:
while not self.is_at_end():
self.start = self.current # mark start of next token
self.scan_token() # consume chars, emit token
self.tokens.append(Token(TokenType.EOF, "", self.line, self.col))
return self.tokensscan_token() reads the current character and dispatches. Whitespace (space, tab, carriage return) is silently skipped. Newlines increment the line counter. Everything else triggers token recognition.
The multi-word keyword problem
Most programming languages use single-word reserved words: if, while, print. A character-at-a-time scanner can handle these easily: when it sees a letter, it accumulates an identifier, then checks if that identifier matches a keyword.
kemlang-py's Gujarati keywords are phrases: bhai bol(print), kem bhai (program start), aavjo bhai (program end), bapu tame bolo (read input). The word bhai alone is not a valid token - only the full phrase is.
kemlang-py solves this by checking for multi-word sequences at the start of every scan_token() call, before doing anything else. It uses Python's string startswith() to peek ahead without moving the cursor, then advances the cursor only if the full phrase matches.
# In __init__: multi-word keywords listed longest-first
self.multiword_keywords = [
("kem bhai", TokenType.KEM_BHAI),
("aavjo bhai", TokenType.AAVJO_BHAI),
("bhai bol", TokenType.BHAI_BOL),
("bapu tame bolo", TokenType.BAPU_TAME_BOLO),
("bhai chhe", TokenType.BHAI_CHHE),
("bhai nathi", TokenType.BHAI_NATHI),
("jya sudhi", TokenType.JYA_SUDHI),
("tame jao", TokenType.TAME_JAO),
("aagal vado", TokenType.AAGAL_VADO),
("nahi to", TokenType.ELSE),
]
# At scan time: try each multi-word keyword before anything else
remaining = self.source[self.current - 1:]
for phrase, token_type in self.multiword_keywords:
if remaining.startswith(phrase):
# advance past the full phrase
self.current += len(phrase) - 1
self.add_token(token_type)
returnStep-by-step scan trace
Here is exactly what the lexer does when it processes this two-line program:
kem bhai bhai bol "kem cho!" aavjo bhai
Source (shown with cursor position ^ advancing left-to-right):
Line 1: k e m b h a i
^
try multi-word: source starts with "kem bhai" MATCH
emit KEM_BHAI 'kem bhai' 1:0
advance cursor 8 chars, skip
, increment line
Line 2: b h a i b o l " k e m c h o ! "
^
skip leading spaces (positions 0-1)
^
try multi-word: "bhai bol" MATCH at col 2
emit BHAI_BOL 'bhai bol' 2:2
advance cursor 8 chars
skip space (position 10)
^
character is '"' -> start string scan
accumulate chars until closing '"'
emit STRING '"kem cho!"' 2:10
advance cursor 10 chars
Line 3: a a v j o b h a i
^
try multi-word: "aavjo bhai" MATCH at col 0
emit AAVJO_BHAI 'aavjo bhai' 3:0
advance cursor 10 chars
End of source -> emit EOF '' 4:0
Final token stream:
┌──────────────┬──────────────────────┬───────┐
│ type │ lexeme │ pos │
├──────────────┼──────────────────────┼───────┤
│ KEM_BHAI │ 'kem bhai' │ 1:0 │
│ BHAI_BOL │ 'bhai bol' │ 2:2 │
│ STRING │ '"kem cho!"' │ 2:10 │
│ AAVJO_BHAI │ 'aavjo bhai' │ 3:0 │
│ EOF │ '' │ 4:0 │
└──────────────┴──────────────────────┴───────┘All token types
Scanning priority order
When scan_token() starts on a new character, it checks candidates in this exact order. The first match wins.
On each new character at self.current:
1. whitespace? space / tab /
→ skip, advance
2. newline?
→ emit NEWLINE, advance line
3. multi-word keyword? "bhai bol", "kem bhai" → emit keyword token
(10 candidates, checked with startswith())
4. comment? // → skip to end of line
5. operator / punct? + - * / % ( ) { } → emit operator token
single characters, looked up in a dict
6. two-char operator? == != <= >= / (peek next) → emit operator token
7. string literal? " → scan to closing "
8. digit? 0-9 → scan integer or float
9. letter? a-z A-Z _ → scan identifier or keyword
after accumulating word: check keywords dict
if matched: emit keyword token
if not matched: emit IDENTIFIER
10. (nothing matched) any other character → raise LexerErrorWhat the lexer rejects
The lexer raises LexerError immediately when it encounters something it cannot tokenize. It does not try to recover - the error carries the exact line and column.
bhai bol x?2The ? character is not part of any token type in kemlang-py.
bhai bol "helloThe lexer scans for a closing " on the same line. If end-of-line arrives first, it raises LexerError. Multi-line strings are not supported.
3.14.15Numbers may contain at most one decimal point. The second . is not a digit and not the end of the number, so the lexer raises LexerError.
Why hand-written over regex?
Many lexers use regular expressions to match token patterns. kemlang-py uses a hand-written scanner for three reasons:
Better error messages
A hand-written scanner knows exactly where it is in the source at all times. It can report the precise line and column of every error, not just a regex match failure.
Multi-word keyword support
Regex-based lexers split on word boundaries before checking for keywords. Matching 'bhai bol' as a two-word unit requires either a preprocessing step or a more complex tokenizer design. The hand-written approach handles it naturally with startswith().
Full control over scanning
The scanner can implement context-sensitive behavior (like how string scanning differs from identifier scanning) without complex regex lookahead.