An Antlr4-Based Expression Parser

Lou Mauget Java, Programming, Technology Snapshot Leave a Comment

In this blog, we’ll present a simple arithmetic expression parser implemented through an Antlr4 parser generator. It will be able to take in an input string (such as 2+4+-4+-2*10%9*7) to produce the result (-12.0).

You may be thinking, “Great, but what’s the point?” Well, to answer your question, as simple as this example may seem, the principles involved actually extend to use cases such as DSLs, transpilation, and anything else expressible by grammar rules.

This post has two parts. In part 1, we’ll discuss the background components of a parser. In part 2, we’ll cover building the demo and running it. If you already understand grammar parsing, you could skip part one.

Contents:

Part 1: Lexers, Parsers, & Grammars

Lexers
Parsers
Grammars

Part 2: The Demo

Install Antlr4 Environment

IDE
Command Line
Simple Expression Grammar

IntelliJ with Antlr4 Plugin

Command-Line Test Rig

Default Java Package

Mac OS or Linux Test Rig Wrapper
Windows Test Rig Wrapper

Using the Parse AST

Antlr4 Visitors

Execution

Unit Test

Summary
References

Part 1: Lexers, Parsers, & Grammars

Let’s start by skimming three concepts involving the recognition of a language of a grammar: lexer, parser, and grammar:

Lexers

A lexer carries out a lexical analysis that identifies or skips groups of input characters. It provides input to the parser that we’ll cover next. The lexer could be a hand-coded function or procedure. The Antlr4 application generates a lexer from a set of lexer rules, as we’ll see in our expression grammar momentarily. We specify rules; Antlr4 generates a lexer from them.

Rules resemble assignment statements where names begin with uppercase alpha. The lexer tries to match the right-hand side against the current inbound character sequence. If a match is found, the rule reduces to a token named by the left-hand side.

Multiple rules of a set could match, but the earliest rule wins. The right-side syntax uses a regular expression variant. For example, the explicit rule set for our expression lexer is:

NUMBER : ('0' .. '9') + ('.' ('0' .. '9') +)? ;
WS : [ \r\n\t] + -> skip ;

This lexer produces a NUMBER token when it matches a number regular expression. Notice the WS token that matches white space characters? Its right-arrow syntax sends all WS tokens to an ignored “skip channel.” Thus no white space travels to this lexer’s output. Some lexers define comment tokens that are skipped as well.

Consider an expression string 123.45 6.79. A lexer could emit the following tokens for that character string::

NUMBER:”123.45”
NUMBER:”6.79”

Antlr4 enables implicit literal token values sprinkled in grammar rules as well. For example, our demo lexer includes implicit tokens ‘+’, ‘*’, “%’, and ‘-’, that appear in the parser grammar.

A lexer, unlike a parser, doesn’t care about ordering of tokens. There is no inter-token context. The example string 123.45 6.79 makes our lexer happy, but it isn’t a valid arithmetic expression.

On the other hand, a string such as The quick brown fox would anger our lexer because it has no matching rules for that string.

Take-Aways

  1. Any lexer produces a collection of tokens for input to a parser
  2. A lexer-emitted token typically consists of at least a type and a value
  3. Our lexer skips white space according to a rule.
  4. A lexer is blind to the ordering of expected character groups (i.e. item context). For example, our lexer would be equally happy with 1 2 3 4 5, even though it is not a valid arithmetic expression.
  5. Our defined lexer would complain if fed some unknown grouping of characters, such as x1=999y+9

Parsers

A parser recognizes prescribed contextual orderings of tokens specified by a set of parsing rules. Antler4 assumes parser grammar names begin with lowercase alpha, as distinguished from a lexer grammar name. This enables combining the two kinds of rules within one grammar file.

A set of rules describes a source “language”. A hand-coded recursive function set could implement the parser, but Antlr4 generates its parser package from a parser rule text set.

The generated parser matches input tokens in turn to rules, emitting an abstract syntax tree (AST). It is the parser’s output product.

Out beginning rule for the demo expression parser is:

start : expr | <EOF> ;

The parser would quietly stop if the input were empty. Otherwise, it would recursively drill through this rule looking for the earliest matching expr rule, producing output as it goes.

If a token doesn’t match any rule, the parser emits a syntax error. In other words, the parser expects to match tokens in a grammar-defined language context sequence.

Here’s a parser grammar fragment that participates in recognizing tokens of an addition operation such as tokens for characters 3 + 4:

expr := expr addop expr

Some characteristics:

  • Notice that the rule is left-recursive, which is legal in Antlr4.
  • Each expr could be composed of one or more expr AST nodes.
  • As a result, AST has leaf nodes consisting of valid concrete tokens.

A parser caller could “walk” that AST to produce an output or action described by the source language. How?

Antlr4 provides optional callback visitor hooks during AST generation. Each callback would activate when its parent grammar node generates input to the AST. We’ll use visitor callbacks to implement our demo expression parser.

Take-Aways

  1. A lexer simplifies a parser’s task by reducing character groups into recognizable meanings, as well as omitting noise such as whitespace or defined comment sequences.
  2. Parsers expect lexer-emitted tokens to appear in a per-kind sequence defined by parser rules.
  3. A token of a sequence that doesn’t fit a rule causes a syntax error.
  4. An Antlr4 parser outputs an AST from a valid source token sequence.
  5. Antlr4 visitors enable a client application to tap into an AST tree-walk to produce events or result data.

Grammars

Let’s summarize the rule concept. Again, a set of lexer and parser rules define a grammar. Remember that Antlr4 optionally enables combining the lexer and parser into one composite grammar file. The two could be separate files as well. Additionally, we could even segment a large grammar (e.g. think Cobol) into many files via an include mechanism.

See Also:  Creating A Blockchain In JavaScript

Lexer item names begin with an uppercase alphabetic character. Parser character names start with a lowercase letter.

A parser grammar must have a start rule that acts as a kind of entry point. The client code passes the start rule name into Antlr.

Take-Aways

  1. Remember that lexer and parser rules consist of rules that resemble assignment statements.
  2. A rule reduces its right-hand side to a left-hand item.
  3. Items that begin with a lowercase character are parser items.
  4. Items that begin with an uppercase character are lexer items.
  5. A lexer grammar transforms character sequences into tokens, while a parser grammar transforms token sequences to an AST.

Part 2: The Demo

For this part, access the demo code GitHub project.

Antlr4 supports the following targets:

  1. Java
  2. C#
  3. Python (2 and 3)
  4. JavaScript
  5. Go
  6. C++
  7. Swift

Java is the traditional Antlr4 build target, so we’re generating Java code. Let’s install Antlr4, build the demo, generate an AST, and run a test that evaluates arithmetic expressions.

Install Antlr4 Environment

We require Maven 3 to build, package, and test the demo expression parser.

For Antlr installation, refer to this link. An IDE with an Antlr4 plugin is a satisfying environment for debugging grammars using syntax highlighting, and such.

IDE

There are IDE plugins for NetBeans, Eclipse, and IntelliJ. A plugin adds grammar syntax highlighting, code generation, and a grammar test environment. We used IntelliJ IDEA to create the calculator expression parser as a Maven module.

You could as well work with the project in Eclipse or NetBeans instead of IntelliJ.

Command Line

We recommend installing the full Antlr4 package as well as an Antlr4 IDE plugin. That way you could work with the stand-alone Antlr4 test rig and build from the command line using Maven.

We installed on two platforms: MacOS and Windows 10.

On Mac OS we installed full Antlr4 via Homebrew:
brew install antlr

On Windows 10, we installed the full Antlr4 jar in a user-local directory, placing its bin folder on the local path.

We didn’t create a Linux variant. If you build the demo on Linux, refer to this link.

Simple Expression Grammar

Let’s create an Antlr arithmetic expression grammar that exhibits operator precedence and parenthetical precedence hoisting. Binary multiplication, division, or modulus operations on factors will take evaluation precedence over addition or subtraction binary operations on terms. A unary minus on a term will have the highest priority of all.

A string of numbers and operators produces an AST if the parser recognizes a valid sequence of tokens of them produced by Antlr4’s lexical analyzer.

Our directly left-recursive grammar is sufficient to recognize a left-to-right arithmetic expression. The | is an “or” used to create rule variants in order. Separate rules would work as well.

Multiplication or division pairs have precedence over addition or subtraction pairs. A parenthesized group has elevated precedence. A unary minus has the highest precedence of all.

Annotations, such as # ADDOPGRP, decorate the AST. They don’t define the grammar. They’re used by Antlr4 to produce distinct visitor names. We’ll output the input expression result by coding simple callbacks from those annotations. Here’s the combined grammar:

grammar Calculator;

// parser

start : expr | <EOF> ;

expr : '-' expr     # UMINUS
   | expr mulop expr # MULOPGRP
   | expr addop expr # ADDOPGRP
   | '(' expr ')'   # PARENGRP
   | NUMBER      # DOUBLE
   ;

addop : '+' | '-' ;

mulop : '*' | '/' | '%' ;

// lexer

NUMBER : ('0' .. '9') + ('.' ('0' .. '9') +)? ;

WS : [ \r\n\t] + -> skip ;

This grammar combines the lexer and parser source rules. Remember that a lexer element is capitalized while a parser item starts with a lowercase alpha character. We’ve segregated the two source categories into a top and bottom, except that lexer literals such as ‘(‘ are distributed across the parser rules. Those become part of the lexer along with the explicit lexer rules.

Let’s send the following arithmetic expression through the ANTLR Preview test rig. We’ve told it that the grammar is file Calculator.g4 with starting rule “start”.
2 + 3 * 4 + (7-2)

First, we’ll exercise the grammar in an IDE via the IntelliJ IDEA Antlr4 plugin. Afterward, we’ll send it through the Antlr4 command-line test rig.

IntelliJ with Antlr4 Plugin

When using IntelliJ,

  1. Display the Calculator.g4 grammar in the editor;
  2. Set the grammar’s start rule by right-clicking the start rule in the editor;
  3. Select the plugin tab;
  4. Enter an arithmetic expression with the Input option selected

The Parse tree tab instantly renders the input’s recognized AST provided that the parser recognizes the expression as valid.

Command-Line Test Rig

You may need to incorporate grammar testing into a script. Use the Antlr4 Test Rig for testing several aspects of a grammar or just the lexer without an IDE. It assumes that the grammar sits in the default Java package. You can test by issuing grun from a command line. The test rig displays available option helpers when invoked sans-arguments. It has a GUI option that corresponds to the plugin’s tree view. A screenshot follows.

Notice that there are differences from the IDE plugin’s tree. Grammar annotations don’t render here (e.g. our MULOPGRP). Notice the tree view in the left-hand pane. Both panes rendering side-by-side are useful in walking the tree if we click expansion twisties while still viewing the graphic tree. The IDE plugin only displays each view separately.

Default Java Package

Generated code resides in the same package as the grammar. An Antlr4 aggravation is a mostly unwritten assumption that grammars are located in the Java default package. Antlr4 generates code to the grammar’s package. Yes, but we’re conditioned not to write Java code in the default package, correct?

Not to worry – much. If we locate the grammar/lexer in a package directory, then Antlr4 generates its Java code into that package. God is still not in heaven: the easy-to-use test rig wrapper, grun, has eyes only for the default package. We ditched it in favor of adding the grammar to the classpath and invoking TestRig directly from a script.

See Also:  Spring Boot and React: Happily Ever After

Following are examples of bash and Windows test rig scripts. Alas, paths vary with OS and local computers, so you would customize paths for your situation. We added comments to the two paths-of-interest in each script.

Mac OS or Linux test rig wrapper

#!/usr/bin/env bash
#
#-- Antlr4 path (here, to Mac Homebrew tree)
CP=/usr/local/Cellar/antlr/4.7.2/antlr-4.7.2-complete.jar
#
#-- Append Grammar path (here, to generated classes in Maven target)
CP="${CP}:/Users/mauget/IdeaProjects/calculator/parser/target/classes"
#
#-- Call TestRig with command line parameters
#-- E.g. ./gtest.sh com.rogersalumni.calculator.g4.Calculator start -gui <expression.txt
java -cp ${CP} org.antlr.v4.gui.TestRig "[email protected]"

Example invocation:
./gtest.sh com.rogersalumni.calculator.g4.Calculator start -gui <expression.txt

Windows test rig wrapper

echo off
rem -- Antlr4 path (here, to local "mauget" user's java directory)
set CP=c:\users\mauget\java\antlr-4.7.2-complete.jar

rem -- Append Grammar path (here, to generated classes in Maven target)
set CP=%CP%;c:\users\mauget\IdeaProjects\calculator\parser\target\classes

rem -- Call TestRig with command line parameters
rem -- E.g. gtest.cmd com.rogersalumni.calculator.g4.Calculator start -gui <expression.txt
java -cp %CP% org.antlr.v4.gui.TestRig %1 %2 %3 %4 %5

Example invocation:
.gtest.cmd com.rogersalumni.calculator.g4.Calculator start -gui <expression.txt

Using The Parse AST

So far, we can chop input text into tokens, identify correct contextual sequences of them, and then produce an abstract syntax tree (AST) – the parse tree. We can see it in a test view. How does that get us anywhere?

The AST is a collection of identified tokens recognized to be in a correct sequence for the language defined by the grammar.

For our grammar, the following are lexically correct token inputs arranged in a sentence:

- / 3 4

The lexer likes the inputs, but the parser complains of their sequence:

line 1:2 extraneous input '/' expecting {'-', '(', NUMBER}

The parser likes the following arrangement of the same token, producing a tree instead of an error message:

- 3 / 4

We, or something, must walk that AST, top-to-bottom, left-to-right, producing immediate actions or emitting an intermediate language (IL). Let’s walk this AST, as carried out by the Antlr4 test plugin:

Something needs to parse or interpret any emitted intermediate language (IL) to be useful, or else we could carry out actions during AST visitation.

Depending upon the task-at-hand, that IL could be something like JSON, XML, or YAML. Those have standard interpreters. Or, perhaps the parser is a transpiler front end for another language. Here, we could emit custom bytecode for a virtual stack machine.

Antlr4 visitors

So, Antlr4 can programmatically walk the AST, calling our code at nodes-of-interest. That’s our approach. Visitor callbacks supplied by the demo compute the expression on-the-fly during an AST walk. This is possible because we wrote the grammar rule ordering to cause building an AST in correct arithmetic operator precedence.

One of the demo visitor callbacks that evaluates a parsed add operation follows. Note that its name is derived from one of our grammar annotations, ADDOPGRP:

@Override
public Double visitADDOPGRP(CalculatorParser.ADDOPGRPContext ctx) {
	String op = ctx.addop().getText();
	Double left = visit(ctx.expr(0));
	Double right = visit(ctx.expr(1));

	return eval(op, left, right);
}

Execution

  1. The demo jar is target/parser-1.0-SNAPSHOT.jar.
  2. It takes the input stream as a command-line argument.
  3. The lexer consumes the argument, producing a token stream.
  4. The parser consumes the token stream, emitting a tree – the AST.
  5. Finally, the visitor callbacks walk the tree.

That sequence implemented in the main class:

public class CalculatorAppImpl implements CalculatorApp {</code>

private static final Logger Log = LogManager.getLogger(CalculatorAppImpl.class);

public static void main(String[] args) {
String arg = args.length &gt; 0 ? args[0] : "-1 / 2";

CalculatorAppImpl calculator = new CalculatorAppImpl();
Double result = calculator.calculate(arg);
Log.info(arg + " = " + result);
}

@Override
public Double calculate(String arithmeticExpression) {
CodePointCharStream input = CharStreams.fromString(arithmeticExpression);
return compile(input);
}

private Double compile(CharStream input) {
CalculatorParser parser = new CalculatorParser(new CommonTokenStream(new CalculatorLexer(input)));
ParseTree tree = parser.start();

CalculatorVisitorImpl calculatorVisitor = new CalculatorVisitorImpl();
return calculatorVisitor.visit(tree);
}
}

Running Java class com.rogersalumni.calculator.CalculatorAppImpl with the command-line argument: (5-4)*(12-11)/((((5-4)*(12-11)))) will emit the following on stdout:

(5-4)*(12-11)/((((5-4)*(12-11)))) = 1.0

Try pasting the expression left of the equal sign into a node JS repl. It, too, will evaluate the input to 1. Our subset of operator precedence rules mimics those of JavaScript.

Here’s another expression that illustrates the same operation priority as JavaScript arithmetic evaluation:

1000-500*2 = 0.0

That’s it. The parser sucks in an arithmetic expression argument, producing a numeric result on stdout while applying conventional operator precedence ordering,

Unit Test

Run mvn test to exercise the following test. It asserts that a series of expressions evaluate to correct results.

	public class CalculatorAppImplentatonTest {
	@Test
	public void testCalculate() {
  	CalculatorAppImpl app = new CalculatorAppImpl();
  	assertEquals("20.0", String.valueOf(app.calculate("-2+2*10+2")));
  	assertEquals("-12.0", String.valueOf(app.calculate("2+4+-4+-2*10%9*7")));
  	assertEquals("-0.5", String.valueOf(app.calculate("1/-2")));
  	assertEquals("6.0", String.valueOf(app.calculate("4-1+2+1")));
  	assertEquals("1.0", String.valueOf(app.calculate("(5-4)*(12-11)/((((5-4)*(12-11))))")));
  	assertEquals("1.0", String.valueOf(app.calculate("11%(2*5)")));
  	assertEquals("0.0", String.valueOf(app.calculate("1000-500*2")));
  	assertEquals("-100.0", String.valueOf(app.calculate("((((((-99))))))-1")));
  	assertEquals("-5.0", String.valueOf(app.calculate("1 + 2 *	-3")));
	}
}

Summary

In summary, in this post:

  1. We skimmed the background concepts of lexer, parser, and grammar.
  2. Our Antlr4 environment created a parser from a combined lexer-parser arithmetic expression grammar.
  3. An Antlr4 IDE plugin and an Antlr4 stand-alone test rig each exercised the grammar, to produce an AST graphic.
  4. We overrode the Antlr4-generated visitor implementation to compute each evaluated factor or term.
  5. Unit tests exercised the application, feeding it arithmetic expression examples.
  6. We enumerated the application execution steps.

Granted, this is not a profound case, but its technology applies to realistic-use cases where visitors would emit an intermediate language, such as JSON. It could feed a DSL, transpiler, or any other useful task describable by a grammar.

Give it a try, and if you do, let us know how it went via the comment section!

References

Antlr4 Wikipedia Page
Official Antlr4 Documentation
Demo Source Code

What Do You Think?