docs

How it works

The Lexer

The lexer is the first stage of the pipeline. It reads raw source text one character at a time and produces a stream of tokens - the atoms from which the parser builds a program tree.

Input

str

raw source text

Output

List[Token]

ordered token stream

Source

lexer.py

256 lines

What is a lexer?

A lexer (also called a scanner or tokenizer) converts a flat string of characters into a structured sequence of tokens. The parser cannot work directly on raw text - it needs discrete, labeled units it can reason about grammatically.

Think of it like reading a sentence. Before you can parse "the cat sat on the mat" into subject-verb-object structure, your brain first groups the characters into words. The lexer does the same thing for source code.

What is a token?

A token is a small, labeled chunk of source text. In kemlang-py, every token carries four pieces of information:

kemlang/types.py
@dataclass
class Token:
    type:    TokenType   # what kind of thing this is
    lexeme:  str         # the exact source text e.g. "bhai bol"
    line:    int         # 1-indexed line number in the source file
    col:     int         # 0-indexed column offset on that line
    literal: Any = None  # parsed value for strings/numbers

The type field is a TokenTypeenum member. There are 36 token types total in kemlang-py: 16 keyword types, 5 literal types, 11 operator types, 4 delimiter types, and EOF.

How the scanning loop works

The lexer maintains a cursor (self.current) that advances through the source string. On each iteration of the main loop, it calls scan_token(), which reads the character at the cursor, decides what kind of token starts here, and advances the cursor past it.

kemlang/lexer.py - main loop
def tokenize(self) -> list[Token]:
    while not self.is_at_end():
        self.start = self.current   # mark start of next token
        self.scan_token()           # consume chars, emit token

    self.tokens.append(Token(TokenType.EOF, "", self.line, self.col))
    return self.tokens

scan_token() reads the current character and dispatches. Whitespace (space, tab, carriage return) is silently skipped. Newlines increment the line counter. Everything else triggers token recognition.

The multi-word keyword problem

Most programming languages use single-word reserved words: if, while, print. A character-at-a-time scanner can handle these easily: when it sees a letter, it accumulates an identifier, then checks if that identifier matches a keyword.

kemlang-py's Gujarati keywords are phrases: bhai bol(print), kem bhai (program start), aavjo bhai (program end), bapu tame bolo (read input). The word bhai alone is not a valid token - only the full phrase is.

kemlang-py solves this by checking for multi-word sequences at the start of every scan_token() call, before doing anything else. It uses Python's string startswith() to peek ahead without moving the cursor, then advances the cursor only if the full phrase matches.

kemlang/lexer.py - multi-word keyword check
# In __init__: multi-word keywords listed longest-first
self.multiword_keywords = [
    ("kem bhai",       TokenType.KEM_BHAI),
    ("aavjo bhai",     TokenType.AAVJO_BHAI),
    ("bhai bol",       TokenType.BHAI_BOL),
    ("bapu tame bolo", TokenType.BAPU_TAME_BOLO),
    ("bhai chhe",      TokenType.BHAI_CHHE),
    ("bhai nathi",     TokenType.BHAI_NATHI),
    ("jya sudhi",      TokenType.JYA_SUDHI),
    ("tame jao",       TokenType.TAME_JAO),
    ("aagal vado",     TokenType.AAGAL_VADO),
    ("nahi to",        TokenType.ELSE),
]

# At scan time: try each multi-word keyword before anything else
remaining = self.source[self.current - 1:]
for phrase, token_type in self.multiword_keywords:
    if remaining.startswith(phrase):
        # advance past the full phrase
        self.current += len(phrase) - 1
        self.add_token(token_type)
        return

Step-by-step scan trace

Here is exactly what the lexer does when it processes this two-line program:

input source
kem bhai
  bhai bol "kem cho!"
aavjo bhai
scanning trace - cursor position and tokens emitted
  Source (shown with cursor position ^ advancing left-to-right):

  Line 1:  k e m   b h a i 

           ^
           try multi-word: source starts with "kem bhai"  MATCH
           emit  KEM_BHAI  'kem bhai'  1:0
           advance cursor 8 chars, skip 
, increment line

  Line 2:     b h a i   b o l   " k e m   c h o ! " 

              ^
              skip leading spaces (positions 0-1)
              ^
              try multi-word: "bhai bol" MATCH at col 2
              emit  BHAI_BOL  'bhai bol'  2:2
              advance cursor 8 chars

              skip space (position 10)
                            ^
              character is '"' -> start string scan
              accumulate chars until closing '"'
              emit  STRING  '"kem cho!"'  2:10
              advance cursor 10 chars

  Line 3:  a a v j o   b h a i
           ^
           try multi-word: "aavjo bhai" MATCH at col 0
           emit  AAVJO_BHAI  'aavjo bhai'  3:0
           advance cursor 10 chars

  End of source -> emit  EOF  ''  4:0

  Final token stream:
  ┌──────────────┬──────────────────────┬───────┐
  │ type         │ lexeme               │ pos   │
  ├──────────────┼──────────────────────┼───────┤
  │ KEM_BHAI     │ 'kem bhai'           │ 1:0   │
  │ BHAI_BOL     │ 'bhai bol'           │ 2:2   │
  │ STRING       │ '"kem cho!"'         │ 2:10  │
  │ AAVJO_BHAI   │ 'aavjo bhai'         │ 3:0   │
  │ EOF          │ ''                   │ 4:0   │
  └──────────────┴──────────────────────┴───────┘

All token types

token typesource text / meaning
MULTI-WORD KEYWORDS
KEM_BHAIkem bhai - program start
AAVJO_BHAIaavjo bhai - program end
BHAI_BOLbhai bol - print statement
BAPU_TAME_BOLObapu tame bolo - read input
BHAI_CHHEbhai chhe - boolean true
BHAI_NATHIbhai nathi - boolean false
JYA_SUDHIjya sudhi - while condition
TAME_JAOtame jao - break
AAGAL_VADOaagal vado - continue
ELSEnahi to - else
SINGLE-WORD KEYWORDS
AAaa - variable declaration
CHEche - assignment
JOjo - if
FARVUfarvu - loop body
LITERALS
INTEGER42, 0, -1 (Python int)
FLOAT3.14, 0.5 (Python float)
STRING"hello" (Python str, double-quoted)
BOOLEANbhai chhe / bhai nathi (Python bool)
IDENTIFIERx, score, myVar (variable names)
OPERATORS
PLUS/MINUS/MULTIPLY/DIVIDE/MODULO+ - * / %
EQUAL/NOT_EQUAL== !=
LESS/GREATER/LESS_EQUAL/GREATER_EQUAL< > <= >=
SPECIAL
EOFend of file - always the last token
NEWLINEline break - filtered out by the parser

Scanning priority order

When scan_token() starts on a new character, it checks candidates in this exact order. The first match wins.

scan_token() decision order
  On each new character at self.current:

  1.  whitespace?          space / tab / 
         → skip, advance
  2.  newline?             
                        → emit NEWLINE, advance line
  3.  multi-word keyword?  "bhai bol", "kem bhai"   → emit keyword token
        (10 candidates, checked with startswith())
  4.  comment?             //                        → skip to end of line
  5.  operator / punct?    + - * / % ( ) { }        → emit operator token
        single characters, looked up in a dict
  6.  two-char operator?   == != <= >= / (peek next) → emit operator token
  7.  string literal?      "                         → scan to closing "
  8.  digit?               0-9                       → scan integer or float
  9.  letter?              a-z A-Z _                 → scan identifier or keyword
        after accumulating word: check keywords dict
        if matched: emit keyword token
        if not matched: emit IDENTIFIER
  10. (nothing matched)    any other character       → raise LexerError

What the lexer rejects

The lexer raises LexerError immediately when it encounters something it cannot tokenize. It does not try to recover - the error carries the exact line and column.

Unexpected characterbhai bol x?2

The ? character is not part of any token type in kemlang-py.

Unterminated stringbhai bol "hello

The lexer scans for a closing " on the same line. If end-of-line arrives first, it raises LexerError. Multi-line strings are not supported.

Invalid number3.14.15

Numbers may contain at most one decimal point. The second . is not a digit and not the end of the number, so the lexer raises LexerError.

Why hand-written over regex?

Many lexers use regular expressions to match token patterns. kemlang-py uses a hand-written scanner for three reasons:

Better error messages

A hand-written scanner knows exactly where it is in the source at all times. It can report the precise line and column of every error, not just a regex match failure.

Multi-word keyword support

Regex-based lexers split on word boundaries before checking for keywords. Matching 'bhai bol' as a two-word unit requires either a preprocessing step or a more complex tokenizer design. The hand-written approach handles it naturally with startswith().

Full control over scanning

The scanner can implement context-sensitive behavior (like how string scanning differs from identifier scanning) without complex regex lookahead.