Precedence climbing is a variation of recursive descent parsing that's widely used to parse expressions in production compilers (including clang, zsh, etc..). Such a parser turns an expression into an Abstract Syntax Tree (AST) like the following:
There are many good introductions to the algorithm (this one and this other), but they only describe how to parse basic arithmetic expressions. This article shows how to extend them with some extra features needed for a general purpose programming language:
[]
operator)-x
and !x
As a result, we will be able to parse more interesting AST's like the following:
We start with a basic implementation that covers arithmetic expressions (largely taken from this article). For brevity, the lexer is replaced by an hardcoded list of tokens. The complete code for this exercise is here.
The main precedence climbing loop lives in parse_expression
. As in typical
implementations, it iterates over primary expressions (ie numbers, identifiers,
etc..) and operators until it finds one whose precedence is lower than the lower
bound.
fn parse_expression(&mut self, min_precedence: u8) -> Result<Expression, ParseError> {
let mut primary_lhs = self.parse_primary()?;
loop {
if let Some(peek) = self.tokens.peek()
&& let Some(op_info) = get_op_info(peek)
{
if op_info.precedence < min_precedence {
break;
}
// Consume the current operator token.
self.tokens.next();
let next_min_precedence = match op_info.associativity {
Associativity::Left => op_info.precedence + 1,
Associativity::Right => op_info.precedence,
};
let primary_rhs = self.parse_expression(next_min_precedence)?;
primary_lhs = Expression::FunctionCall {
name: op_info.name.to_string(),
args: vec![primary_lhs, primary_rhs],
};
} else {
break;
}
}
return Ok(primary_lhs);
}
parse_expression
calls into parse_primary
which can parse numbers,
identifiers, and parenthesized subexpressions (so called primary expressions):
fn parse_primary(&mut self) -> Result<Expression, ParseError> {
let mut expr = match self.tokens.peek() {
Some(Token::LParen) => {
self.expect(Token::LParen)?;
// Start parsing a subexpression inside parentheses, startin over from the
// minimum precedence.
let subexpr = self.parse_expression(PREC_MIN)?;
self.expect(Token::RParen)?;
Ok(subexpr)
}
Some(Token::Ident(s)) => {
let name = s.clone();
self.tokens.next();
Ok(Expression::Variable { name })
}
Some(Token::NumberLit(s)) => {
match s.parse::<i64>() {
Ok(x) => {
self.tokens.next(); // consume number
Ok(Expression::IntConst { value: x })
}
Err(_) => Err(ParseError::new(format!("Invalid number literal: {}", s))),
}
}
None => Err(ParseError::new(
"Unexpected end of input while parsing expression",
)),
_ => Err(ParseError::new(format!(
"Unexpected token in expression: {:?}",
self.tokens.peek()
))),
}?;
Ok(expr)
}
An interesting thing to note is that parse_primary
calls back into
parse_expression
to parse subexpressions inside parentheses, and it does so
with the lowest precedence. This effectively starts over the precedence climbing
process for the new subexpression.
The first thing we're going to add to our basic parser are function calls and the indexing operator. They're both postfix operators that have the highest precedence (ie bind tighter) than all other infix operators (at least in C-style languages).
We have two options for the implementation:
change parse_primary
to handle postfix operators after parsing an atom, and
"wrap" its AST node with the appropriate operations
add special logic to parse_expression
to check for left paren and brackets
before handling infix operators.
The choice comes down to a tradeoff between the complexity of parse_primary
and parse_expression
. I've chosen to add the new logic to parse_primary
because parse_expression
is already responsible for some tricky bits.
Since we want function calls to have very high precedence, we can attach them to
freshly parsed primary expressions by checking if they're followed by a left
paren. Afterwards, we parse each comma-separated argument as a separate
expression and create the appropriate AST node. Similarly to parenthesized
expressions above, each parse_expression
recursive call starts over with the
minimum precedence.
// Existing parse_primary code.
// ...
loop {
match self.tokens.peek() {
Some(Token::LParen) => {
if let Expression::Variable { name } = expr {
self.tokens.next(); // consume '('
let mut args = Vec::new();
if self.tokens.peek() != Some(&Token::RParen) {
loop {
args.push(self.parse_expression(PREC_MIN)?);
if self.tokens.peek() == Some(&Token::Comma) {
self.tokens.next(); // consume ','
} else {
break;
}
}
}
self.expect(Token::RParen)?;
expr = Expression::FunctionCall { name, args };
} else {
return Err(ParseError::new(
"Only identifiers can be called as functions",
));
}
}
_ => break,
}
}
Ok(expr)
Indexing with the bracket operator [
is implemented in the same way.
This article
shows an implementation of the second option where the function call logic is
shoved into parse_expression
.
We want the parser to also support unary -
and !
operators, respectively
negation for numbers and boolean values. These are prefix operators that also
bing very tightly. We handle them in parse_primary
before all other cases,
and call recursively into parse_expression
with a very high minimum precedence
to implement the high binding power. Since the main precedence climbing loop
allows precedence greater or equal than the minimum precedence, this has the
nice effect of correctly parsing --x
as -(-x)
.
fn parse_primary(&mut self) -> Result<Expression, ParseError> {
let mut expr = match self.tokens.peek() {
Some(Token::Minus) => {
self.tokens.next();
// Unary operator parsed with very high precedence.
let operand = self.parse_expression(PREC_MAX)?;
let args = vec![operand];
Ok(Expression::FunctionCall {
name: "__neg".to_string(),
args,
})
}
// All existing cases.
// ...
}
}
The negation operator can be implemented in the same way.
Finally, we would like to parse assignments of the form x = 10
, and have them
be expressions themselves that evaluate to their right hand side. We also need
=
to have low precedence so that the expression on the right hand side can
evaluate properly.
We achieve this by adding the =
token to the operator precedence table:
fn get_op_info(op: &Token) -> Option<OpInfo> {
match op {
Token::Assign => Some(OpInfo {
precedence: 2,
associativity: Associativity::Right,
name: "__assign",
}),
// Existing cases...
// ...
}
}
This single change implements the basic functionality but has two gotchas:
Assignments are parsed into an FunctionCall
AST node just like all other
operators. This is not especially pleasing.
Since =
has a very low precedence, the parser will accept nonsense
assignments like x + x = 12
(in C-speak, these are situations where the left
hand side is not an l-value).
We can fix both issues with a little bit of special-case code in
parse_expression
to swap the AST node type and check the type of the left hand
side of the assignment.
// Existing parse_expression.
// ...
let primary_rhs = self.parse_expression(next_min_precedence)?;
// Rewrite assignments into the proper AST node.
if op_info.name == "__assign" {
// Check that the LHS is a variable (ie lvalue).
if let Expression::Variable { name } = primary_lhs {
return Ok(Expression::Assignment {
name: name.to_string(),
value: Box::new(primary_rhs),
});
} else {
return Err(ParseError::new(format!(
"Invalid assignment: lhs is not a variable: {:?}",
primary_lhs
)));
}
} else {
primary_lhs = Expression::FunctionCall {
name: op_info.name.to_string(),
args: vec![primary_lhs, primary_rhs],
};
}
This article extended a basic precedence climbing parser with some more functionality: function calls, unary operators, and assignments. The full code with basic tests is here.