Handling Uncommon File Formats - Introducing Lexers

Every so often one encounters a file that is in some proprietary format not easily readable by the standard I/O routines. Well, readable it is, to be sure, line by line – but making sense of the content of the lines –documented as it may be– has to be automated, and the format seems to be different than any other one you've seen so far: not Properties, not CSV, not XML, not even HTML. No, someone has let their imagination run wild and created something that no class or library known to you can handle. What to do?

What's needed is a parser for this format, and while it's not hard to program one for a format for which example files are available, this isn't something that's fun to develop, and it's an error-prone process, too. Luckily, this is a recurring problem, and much time and effort has been expended to study it, and to come up with algorithms and tools to tackle it. It's the realm of lexers and parsers. Once the domain of the venerable lex and yacc tools of Unix lore, there is now no shortage of tools generating Java code. JFlex is the standard lexer, and Antlr, SableCC and JavaCC are the eminent parsers.

What's the advantage of using these tools? They allow us to specify the format not by hardcoding a series of if/then statements (like "if the last character was a '$' then read an integer, otherwise skip 5 bytes and read a string"), but as a set of rules that describe what is expected, and what should happen if the expected input is indeed read from the file. This is also much easier to adapt in case the file format ever gets extended in the future (as these proprietary formats inevitably tend to be). Arguably, these days XML should be used for these purposes, but it will be a long time until the existing formats have truly gone away. Until then, lexers and parsers can make some unpleasant coding tasks much easier to deal with.

This article introduces the JFlex library, and shows how to use it to read a format well-known to all Java developers: properties files. It starts out by recognizing just the core "key=value" lines, then we add comment lines, then continuation lines (where the value is spread out over several lines), and in the end we add an extension to the format – hierarchical properties that get mapped to Java objects. It uses a number of ready-to-run examples that can be found in this file which also contains the JFlex library.

A Basic Lexical Specification

The way lexers work is by creating a lexical specification. This is essentially a collection of regular expressions mixed with Java code that gets executed whenever one of the regular expressions matches part of the input file. One could describe step by step what goes into such a specification, but instead we'll look at the simplest-possible one that JFlex can use. Have a look at the PropertiesReader1.flex file. It eventually gets translated into a Java source file, so parts of it look like regular Java source code. The basic structure is like this:

Import statements, just like in Java, and copied verbatim to the generated source code
%% – a delimiter
Various declarations that determine specifics of the class about to be generated, e.g. the class name and visibility modifier.
${ – delimiter that indicates the start of the Java code block
Regular Java code that is copied verbatim to the generated source. In this case it contains a main method that opens and parses a file.
%} – delimiter that indicates the end of the Java code block
%% – a delimiter
a list of states and their associated regular expressions

#8 is the most interesting one, so let's look at it in detail. It contains only a single rule:

<YYINITIAL> {
    . {
        readCharacter(yytext());
    }
}

YYINITIAL is code for "start the parsing here". The rule starts by specifying the regular expression that should be matched; in this case it's the dot which means "match any character". If a character is matched, the code in the brackets is executed – the readCharacter method is called. A few lines above you can find its declaration, and you'll see that it takes a string as parameter. So where does the yytext method come from, and what does it do? It's a pre-declared method that returns whatever was matched by the regular expression. So in this case it will always return the single character matched before, which will then be printed to standard out by readCharacter.

So let's see what happens. You can run the code by typing "make run1" or

java -classpath .:JFlex.jar PropertiesReader1 example1.properties

if you don't have make installed.

Unfortunately, only parts of the file are displayed, and then an exception occurs: something about "could not match input". How can that be, given that the dot should match any given character? Turns out the dot matches all characters but newline (\n). So we need a second rule that specifically matches a newline. That's what PropertiesReader1a.flex does; it adds this rule:

    \n {
        readNewline();
    }

If you run it via "make run1a", it'll run to completion and print every character in the properties file.

On To The Properties

Now that we have a program that can parse a file (although not yet in a particularly interesting way), let's teach it about property keys and values. Take a look at the end of PropertiesReader2.flex. It introduces two important features. Firstly, it gives names to particular regular expressions. Here, "WhiteSpace" and "LineEnding" are defined. By defining these we can refer to them by name later on, which is nice if we want to use them several times. Plus, using a name makes it easier to figure what's happening compared to looking at the raw expression.

The second important features are states. The previous example only had a single one –YYINITIAL– and we didn't have to declare it. It is always present, so that the parser knows what to start with. In this case, two more are defined – KEY and VALUE. KEY is used for parsing the left side of a property line, and VALUE for the right side.

So the way to read the specification is the following:

Start in state YYINITIAL
If a white space or line ending is found, ignore it
For anything else, it's the start of a key, so we switch to the state that handles keys. That's what the yybegin method does.
In state KEY
- Read as many characters as possible NOT including an equal sign – that's the value of the key
- Then, switch to reading the value.
In state VALUE

Again, read as many characters as possible, not including a line ending – that's the value.
Upon encountering a line ending, switch back to the initial state.

Whew! It takes a bit of time to learn how to read such a specification, but once you understand the concept of named regular expressions, and how they are used to switch between different states it becomes much clearer. One feature I didn't mention, and that's the call to yypushback in the last rule of YYINITIAL. That rule matches if something significant was read – the first character of the key. If we simply switch to the KEY state, that character would be missing from the key, so this method tells the parser to take that one character and "push it back into the input stream". That way it will be read again in the next step, and the KEY state has a chance to see it. This is needed in situations where there is no distinct separator character between different parts of the input file.

If you run this code ("make run2" or its equivalent), it'll print out all the key/value bindings in the file, in this case:

{key1=value 1, key2.key3=value 2 and 3}

Comments

PropertiesReader3.flex adds support for comments to the parser – lines that start with "#". A new state is added –COMMENT– which reads all available characters until the line ending, and simply prints it to the console. We also need a new rule in the INITIAL state that switches to COMMENT upon encountering "#". Note that in this case we don't need to call yypushback, because that character isn't part to the comment – it acts as a delimiter only.

Continuation Lines

The last missing piece are continuation lines – property values that span more than one line. A line that is to be continued has a backslash ("\") at the end. It signifies that whatever is on the next line is still part of the value. This can go on for more than a single line, allowing for lengthy values without the need to include very long lines in the file. PropertiesReader4.flex supports this.

Remarkably, the only change required is one new rule in the VALUE state. In plain text it says "If there's a backslash in the value, with nothing but white space following it until the line ending, append the text (without the white space) to the value, AND stay in state VALUE." That way, the next line read is treated as part of the current value, and not the start of something new.

Beyond Standard Properties

The last example –PropertiesReader5.flex– adds support for hierarchical properties, something that's not part of standard properties. Properties with a numerical suffix like

hierarchical.0=element0
hierarchical.1=element1
hierarchical.2=element2

are stored in a Map with keys "0", "1" and "2". This isn't meant to be tremendously useful, just an example of how easy it is to extend the parser. All that's needed is one new rule in the KEY state that especially handles keys ending with a numerical suffix. Imagine the amount of code needed if you were to program this by hand.

That's All Folks

Reading an uncommonly formatted file (or other textual input) isn't something you need to do often, but if you do, it's made much easier by using the right tool for the job. Hand-crafted parsers are tricky to get right, and tend to be brittle during maintenance.

JFlex lexers have been built for C++, Groovy, Java, JavaScript, XML, HPGL, PxM images, CSV, RTF, Java Properties (just now :-) and probably much else. I've found JFlex easy to work with, and the generated code easily fast enough for my purposes. Your mileage may vary, of course, but it certainly merits checking out.