Nicolò Valigi | articles | talks

Parsing function calls and assignments with precedence climbing

Precedence climbing is a variation of recursive descent parsing that's widely used to parse expressions in production compilers (including clang, zsh, etc..). Such a parser turns an expression into an Abstract Syntax Tree (AST) like the following:

There are many good introductions to the algorithm (this one and this other), but they only describe how to parse basic arithmetic expressions. This article shows how to extend them with some extra features needed for a general purpose programming language:

As a result, we will be able to parse more interesting AST's like the following:

The starting point

We start with a basic implementation that covers arithmetic expressions (largely taken from this article). For brevity, the lexer is replaced by an hardcoded list of tokens. The complete code for this exercise is here.

The main precedence climbing loop lives in parse_expression. As in typical implementations, it iterates over primary expressions (ie numbers, identifiers, etc..) and operators until it finds one whose precedence is lower than the lower bound.

fn parse_expression(&mut self, min_precedence: u8) -> Result<Expression, ParseError> {
 let mut primary_lhs = self.parse_primary()?;

 loop {
  if let Some(peek) = self.tokens.peek()
   && let Some(op_info) = get_op_info(peek)
  {
   if op_info.precedence < min_precedence {
    break;
   }

   // Consume the current operator token.
   self.tokens.next();

   let next_min_precedence = match op_info.associativity {
    Associativity::Left => op_info.precedence + 1,
    Associativity::Right => op_info.precedence,
   };

   let primary_rhs = self.parse_expression(next_min_precedence)?;

   primary_lhs = Expression::FunctionCall {
    name: op_info.name.to_string(),
    args: vec![primary_lhs, primary_rhs],
   };
  } else {
   break;
  }
 }

 return Ok(primary_lhs);
}

parse_expression calls into parse_primary which can parse numbers, identifiers, and parenthesized subexpressions (so called primary expressions):

fn parse_primary(&mut self) -> Result<Expression, ParseError> {
 let mut expr = match self.tokens.peek() {
  Some(Token::LParen) => {
   self.expect(Token::LParen)?;
   // Start parsing a subexpression inside parentheses, startin over from the
   // minimum precedence.
   let subexpr = self.parse_expression(PREC_MIN)?;
   self.expect(Token::RParen)?;
   Ok(subexpr)
  }

  Some(Token::Ident(s)) => {
   let name = s.clone();
   self.tokens.next();
   Ok(Expression::Variable { name })
  }

  Some(Token::NumberLit(s)) => {
   match s.parse::<i64>() {
    Ok(x) => {
     self.tokens.next(); // consume number
     Ok(Expression::IntConst { value: x })
    }
    Err(_) => Err(ParseError::new(format!("Invalid number literal: {}", s))),
   }
  }

  None => Err(ParseError::new(
   "Unexpected end of input while parsing expression",
  )),

  _ => Err(ParseError::new(format!(
   "Unexpected token in expression: {:?}",
   self.tokens.peek()
  ))),
 }?;

 Ok(expr)
}

An interesting thing to note is that parse_primary calls back into parse_expression to parse subexpressions inside parentheses, and it does so with the lowest precedence. This effectively starts over the precedence climbing process for the new subexpression.

Function calls and indexing

The first thing we're going to add to our basic parser are function calls and the indexing operator. They're both postfix operators that have the highest precedence (ie bind tighter) than all other infix operators (at least in C-style languages).

We have two options for the implementation:

The choice comes down to a tradeoff between the complexity of parse_primary and parse_expression. I've chosen to add the new logic to parse_primary because parse_expression is already responsible for some tricky bits.

Since we want function calls to have very high precedence, we can attach them to freshly parsed primary expressions by checking if they're followed by a left paren. Afterwards, we parse each comma-separated argument as a separate expression and create the appropriate AST node. Similarly to parenthesized expressions above, each parse_expression recursive call starts over with the minimum precedence.

 // Existing parse_primary code.
 // ...

 loop {
  match self.tokens.peek() {
   Some(Token::LParen) => {
    if let Expression::Variable { name } = expr {
     self.tokens.next(); // consume '('
     let mut args = Vec::new();
     if self.tokens.peek() != Some(&Token::RParen) {
      loop {
       args.push(self.parse_expression(PREC_MIN)?);
       if self.tokens.peek() == Some(&Token::Comma) {
        self.tokens.next(); // consume ','
       } else {
        break;
       }
      }
     }
     self.expect(Token::RParen)?;
     expr = Expression::FunctionCall { name, args };
    } else {
     return Err(ParseError::new(
      "Only identifiers can be called as functions",
     ));
    }
   }
   _ => break,
  }
 }

 Ok(expr)

Indexing with the bracket operator [ is implemented in the same way.

This article shows an implementation of the second option where the function call logic is shoved into parse_expression.

Unary operators

We want the parser to also support unary - and ! operators, respectively negation for numbers and boolean values. These are prefix operators that also bing very tightly. We handle them in parse_primary before all other cases, and call recursively into parse_expression with a very high minimum precedence to implement the high binding power. Since the main precedence climbing loop allows precedence greater or equal than the minimum precedence, this has the nice effect of correctly parsing --x as -(-x).

fn parse_primary(&mut self) -> Result<Expression, ParseError> {
 let mut expr = match self.tokens.peek() {

  Some(Token::Minus) => {
   self.tokens.next();
   // Unary operator parsed with very high precedence.
   let operand = self.parse_expression(PREC_MAX)?;
   let args = vec![operand];
   Ok(Expression::FunctionCall {
    name: "__neg".to_string(),
    args,
   })
  }

  // All existing cases.
  // ...
 }
}

The negation operator can be implemented in the same way.

Variable assignment

Finally, we would like to parse assignments of the form x = 10, and have them be expressions themselves that evaluate to their right hand side. We also need = to have low precedence so that the expression on the right hand side can evaluate properly.

We achieve this by adding the = token to the operator precedence table:

fn get_op_info(op: &Token) -> Option<OpInfo> {
    match op {
        Token::Assign => Some(OpInfo {
            precedence: 2,
            associativity: Associativity::Right,
            name: "__assign",
        }),

  // Existing cases...
  // ...
 }
}

This single change implements the basic functionality but has two gotchas:

We can fix both issues with a little bit of special-case code in parse_expression to swap the AST node type and check the type of the left hand side of the assignment.

 // Existing parse_expression.
 // ...

 let primary_rhs = self.parse_expression(next_min_precedence)?;

 // Rewrite assignments into the proper AST node.
 if op_info.name == "__assign" {
  // Check that the LHS is a variable (ie lvalue).
  if let Expression::Variable { name } = primary_lhs {
   return Ok(Expression::Assignment {
    name: name.to_string(),
    value: Box::new(primary_rhs),
   });
  } else {
   return Err(ParseError::new(format!(
    "Invalid assignment: lhs is not a variable: {:?}",
    primary_lhs
   )));
  }
 } else {
  primary_lhs = Expression::FunctionCall {
   name: op_info.name.to_string(),
   args: vec![primary_lhs, primary_rhs],
  };
 }

Closure

This article extended a basic precedence climbing parser with some more functionality: function calls, unary operators, and assignments. The full code with basic tests is here.