How it works
The Parser
The parser takes the flat token stream from the lexer and builds a tree that represents the structure of the program - nesting, operator precedence, and the relationships between statements and the expressions they contain.
Input
List[Token]
NEWLINE tokens filtered
Output
Program
root AST node
Source
parser.py
295 lines
Why a tree?
Source code has inherent structure. An if statement contains a condition and one or two blocks. A block contains statements. A statement might contain an expression. An expression might contain sub-expressions. This hierarchy of containment is exactly what a tree represents.
A flat token list loses this structure. You cannot look at a sequence of tokens [JO, IDENTIFIER, GREATER, INTEGER, LEFT_BRACE, ...]and know where the condition ends, where the then-block starts, and where the else-block begins without building a tree that makes those boundaries explicit.
The Abstract Syntax Tree (AST) is the parser's answer. Each node in the tree corresponds to a grammatical construct. The tree is "abstract" because it drops irrelevant syntax details (braces, the word "che", newlines) and keeps only the semantically meaningful structure.
The kemlang-py grammar (BNF)
A context-free grammar defines what sequences of tokens form a valid program. kemlang-py's grammar can be written in BNF (Backus-Naur Form). Each rule defines one grammatical construct in terms of simpler ones.
program ::= "kem bhai" statement* "aavjo bhai"
statement ::= print_stmt
| declaration
| assignment
| if_stmt
| while_stmt
| break_stmt
| continue_stmt
print_stmt ::= "bhai bol" expression
declaration ::= "aa" IDENTIFIER "che" expression
assignment ::= IDENTIFIER "che" expression
break_stmt ::= "tame jao"
continue_stmt::= "aagal vado"
if_stmt ::= "jo" expression block
( "nahi to" block )?
while_stmt ::= "farvu" block "jya sudhi" expression
block ::= "{" statement* "}"
expression ::= comparison
comparison ::= term ( ( "==" | "!=" | "<" | ">" | "<=" | ">=" ) term )*
term ::= factor ( ( "+" | "-" ) factor )*
factor ::= unary ( ( "*" | "/" | "%" ) unary )*
unary ::= "-" unary
| primary
primary ::= INTEGER | FLOAT | STRING | BOOLEAN
| "bapu tame bolo"
| IDENTIFIER
| "(" expression ")"The grammar is stratified into levels: expression at the top delegates to comparison, which delegates toterm, which delegates tofactor, which reachesunary and finallyprimary. This stratification is how operator precedence is encoded.
Operator precedence
Why does 1 + 2 * 3 equal 7and not 9? Because multiplication has higher precedence than addition. The grammar encodes this by making factor (which handles *,/, %) a lower-level rule than term(which handles +, -).
Lower in the grammar = tighter binding = higher precedence. The chain works like this:
expression() lowest precedence
└── comparison() == != < > <= >=
└── term() + -
└── factor() * / %
└── unary() - (negation)
└── primary() literals, variables, (expr)
highest precedence
Parsing 1 + 2 * 3:
term() calls factor() to get left side
factor() calls unary() -> primary() -> returns Literal(1)
term() sees '+', so it calls factor() again for right side
factor() calls unary() -> primary() -> gets Literal(2)
factor() sees '*', so it calls unary() -> primary() -> Literal(3)
factor() returns Binary(*, Literal(2), Literal(3)) <- binds tighter
term() returns Binary(+, Literal(1), Binary(*, 2, 3))
Result: 1 + (2 * 3) = 7 correct!Recursive descent
kemlang-py uses a hand-written recursive-descent parser. "Recursive descent" means that each grammar rule has a corresponding Python method, and those methods call each other recursively as they parse nested constructs.
def statement(self) -> Stmt | None:
if self.match(TokenType.JO):
return self.if_statement() # dispatch on current token
if self.match(TokenType.AA):
return self.declaration()
# ... etc.
def if_statement(self) -> If:
condition = self.expression() # parse the condition expression
then_branch = self.block() # parse the { ... } body
else_branch = None
if self.match(TokenType.ELSE):
else_branch = self.block() # parse optional nahi to { ... }
return If(condition, then_branch, else_branch)
def expression(self) -> Expr:
return self.comparison() # delegate to next level
def comparison(self) -> Expr:
left = self.term() # get left operand
while self.match(TokenType.EQUAL, TokenType.NOT_EQUAL, ...):
op = self.previous()
right = self.term()
left = Binary(left, op, right) # build binary node
return leftAST node types
Every node in the tree is a frozen Python @dataclass. Dataclasses give free __repr__ and __eq__ without boilerplate, and they are immutable once created - the parser builds the tree, then the interpreter reads it without modification.
Parsing a complete example
Here is what the parser produces for a short conditional program:
kem bhai
aa x che 10
jo x > 5 {
bhai bol "big"
} nahi to {
bhai bol "small"
}
aavjo bhai Program
├── Declaration
│ ├── name: "x"
│ └── initializer: Literal(10)
└── If
├── condition: Binary
│ ├── left: Variable("x")
│ ├── operator: Token(GREATER, ">")
│ └── right: Literal(5)
├── then_branch: Block
│ └── Print
│ └── expression: Literal("big")
└── else_branch: Block
└── Print
└── expression: Literal("small")Notice that the AST contains no braces, no che, no jo, no nahi to. The parser consumed those tokens to understand structure, then discarded them. The tree captures only what the interpreter needs to execute the program.
Parse errors
The parser raises ParseError when the token at the current position does not match what the grammar expects. Unlike the lexer, which errors immediately, the parser reports the specific token it expected and the token it found instead.
Common causes: missing kem bhai or aavjo bhai, a block without braces, a declaration without an initializer (aa x che with nothing after), or a farvu loop without jya sudhi.
ParseError: expected '{' after 'jo' condition at line 3, column 14
ParseError: expected 'jya sudhi' after loop body at line 7, column 0
ParseError: program must start with 'kem bhai' at line 1, column 0