How it works

The Parser

The parser takes the flat token stream from the lexer and builds a tree that represents the structure of the program - nesting, operator precedence, and the relationships between statements and the expressions they contain.

Input

List[Token]

NEWLINE tokens filtered

Output

Program

root AST node

Source

parser.py

295 lines

Why a tree?

Source code has inherent structure. An if statement contains a condition and one or two blocks. A block contains statements. A statement might contain an expression. An expression might contain sub-expressions. This hierarchy of containment is exactly what a tree represents.

A flat token list loses this structure. You cannot look at a sequence of tokens [JO, IDENTIFIER, GREATER, INTEGER, LEFT_BRACE, ...]and know where the condition ends, where the then-block starts, and where the else-block begins without building a tree that makes those boundaries explicit.

The Abstract Syntax Tree (AST) is the parser's answer. Each node in the tree corresponds to a grammatical construct. The tree is "abstract" because it drops irrelevant syntax details (braces, the word "che", newlines) and keeps only the semantically meaningful structure.

The kemlang-py grammar (BNF)

A context-free grammar defines what sequences of tokens form a valid program. kemlang-py's grammar can be written in BNF (Backus-Naur Form). Each rule defines one grammatical construct in terms of simpler ones.

kemlang-py grammar in BNF notation

  program     ::= "kem bhai" statement* "aavjo bhai"

  statement   ::= print_stmt
                | declaration
                | assignment
                | if_stmt
                | while_stmt
                | break_stmt
                | continue_stmt

  print_stmt  ::= "bhai bol" expression
  declaration ::= "aa" IDENTIFIER "che" expression
  assignment  ::= IDENTIFIER "che" expression
  break_stmt  ::= "tame jao"
  continue_stmt::= "aagal vado"

  if_stmt     ::= "jo" expression block
                  ( "nahi to" block )?

  while_stmt  ::= "farvu" block "jya sudhi" expression

  block       ::= "{" statement* "}"

  expression  ::= comparison

  comparison  ::= term ( ( "==" | "!=" | "<" | ">" | "<=" | ">=" ) term )*

  term        ::= factor ( ( "+" | "-" ) factor )*

  factor      ::= unary ( ( "*" | "/" | "%" ) unary )*

  unary       ::= "-" unary
                | primary

  primary     ::= INTEGER | FLOAT | STRING | BOOLEAN
                | "bapu tame bolo"
                | IDENTIFIER
                | "(" expression ")"

The grammar is stratified into levels: expression at the top delegates to comparison, which delegates toterm, which delegates tofactor, which reachesunary and finallyprimary. This stratification is how operator precedence is encoded.

Operator precedence

Why does 1 + 2 * 3 equal 7and not 9? Because multiplication has higher precedence than addition. The grammar encodes this by making factor (which handles *,/, %) a lower-level rule than term(which handles +, -).

Lower in the grammar = tighter binding = higher precedence. The chain works like this:

operator precedence chain (lower = tighter binding)

  expression()          lowest precedence
    └── comparison()    == != < > <= >=
          └── term()    + -
                └── factor()   * / %
                      └── unary()    - (negation)
                            └── primary()   literals, variables, (expr)
                                            highest precedence

  Parsing 1 + 2 * 3:

    term() calls factor() to get left side
      factor() calls unary() -> primary() -> returns Literal(1)
    term() sees '+', so it calls factor() again for right side
      factor() calls unary() -> primary() -> gets Literal(2)
      factor() sees '*', so it calls unary() -> primary() -> Literal(3)
      factor() returns Binary(*, Literal(2), Literal(3))   <- binds tighter
    term() returns Binary(+, Literal(1), Binary(*, 2, 3))

  Result: 1 + (2 * 3) = 7   correct!

Recursive descent

kemlang-py uses a hand-written recursive-descent parser. "Recursive descent" means that each grammar rule has a corresponding Python method, and those methods call each other recursively as they parse nested constructs.

kemlang/parser.py - how parse_if and parse_expression call each other

def statement(self) -> Stmt | None:
    if self.match(TokenType.JO):
        return self.if_statement()       # dispatch on current token
    if self.match(TokenType.AA):
        return self.declaration()
    # ... etc.

def if_statement(self) -> If:
    condition = self.expression()        # parse the condition expression
    then_branch = self.block()           # parse the { ... } body
    else_branch = None
    if self.match(TokenType.ELSE):
        else_branch = self.block()       # parse optional nahi to { ... }
    return If(condition, then_branch, else_branch)

def expression(self) -> Expr:
    return self.comparison()             # delegate to next level

def comparison(self) -> Expr:
    left = self.term()                   # get left operand
    while self.match(TokenType.EQUAL, TokenType.NOT_EQUAL, ...):
        op = self.previous()
        right = self.term()
        left = Binary(left, op, right)   # build binary node
    return left

AST node types

Every node in the tree is a frozen Python @dataclass. Dataclasses give free __repr__ and __eq__ without boilerplate, and they are immutable once created - the parser builds the tree, then the interpreter reads it without modification.

node typefields

STATEMENTS (PRODUCE SIDE EFFECTS)

Programstatements: list[Stmt]

Blockstatements: list[Stmt]

Printexpression: Expr

Declarationname: str, initializer: Expr

Assignmentname: str, value: Expr

Ifcondition: Expr, then_branch: Block, else_branch: Block | None

Whilebody: Block, condition: Expr (body executes before condition is checked)

Break(no fields)

Continue(no fields)

EXPRESSIONS (RETURN A KEMVALUE)

Literalvalue: int | float | str | bool | None

Variablename: str

Binaryleft: Expr, operator: Token, right: Expr

Unaryoperator: Token, right: Expr

Input(no fields) - evaluates bapu tame bolo at runtime

Parsing a complete example

Here is what the parser produces for a short conditional program:

source

kem bhai
  aa x che 10
  jo x > 5 {
    bhai bol "big"
  } nahi to {
    bhai bol "small"
  }
aavjo bhai

resulting AST (shown as an indented tree)

  Program
  ├── Declaration
  │   ├── name:        "x"
  │   └── initializer: Literal(10)
  └── If
      ├── condition:   Binary
      │                 ├── left:     Variable("x")
      │                 ├── operator: Token(GREATER, ">")
      │                 └── right:    Literal(5)
      ├── then_branch: Block
      │   └── Print
      │       └── expression: Literal("big")
      └── else_branch: Block
          └── Print
              └── expression: Literal("small")

Notice that the AST contains no braces, no che, no jo, no nahi to. The parser consumed those tokens to understand structure, then discarded them. The tree captures only what the interpreter needs to execute the program.

Parse errors

The parser raises ParseError when the token at the current position does not match what the grammar expects. Unlike the lexer, which errors immediately, the parser reports the specific token it expected and the token it found instead.

Common causes: missing kem bhai or aavjo bhai, a block without braces, a declaration without an initializer (aa x che with nothing after), or a farvu loop without jya sudhi.

error format

ParseError: expected '{' after 'jo' condition at line 3, column 14
ParseError: expected 'jya sudhi' after loop body at line 7, column 0
ParseError: program must start with 'kem bhai' at line 1, column 0

The Lexer The Interpreter