An Introduction to java.util.regex

This series of lessons covering regular expressions in Java was modeled after the tutorial created to teach the com.stevesoft.pat package. The com.stevesoft.pat package is available for download and use from JavaRegex.com. It's an excellent alternative package to harvest the power of regular expressions in Java.

This is the second part of a four part introduction to the java.util.regex package. Part one can be found in The September Newsletter.

Part 2: More Pattern Elements

Capturing Groups and Back References - (X) , Matcher's group(int)

One function of parentheses is to provide a grouping ability for parts of a regular expression. The quantifiers and operators introduced in the previous lesson, that were applied to a single character or character class, can then be applied to a group.

    String input =
      "Fee! Fie! Foe! Fum! " +
      "I smell the blood of an Englishman. " +
      "Be he 'live, or be he dead, " +
      "I'll grind his bones to make my bread." ;

    Pattern pattern = Pattern.compile( "(F[a-z]{2}! ){4}" );
    // Matches four occurrences of a pattern that begins 
    // with "F" followed by two lower case letters, a "!" 
    // and a space.

    Matcher matcher = pattern.matcher( input );

    System.out.println( matcher.find() );  // Prints true.
    System.out.println( matcher.group() ); // Prints "Fee! Fie! Foe! Fum! ".

Capturing groups are numbered according to their appearance in the regular expression. The first opening parenthesis is the start of the first capturing group; the second opening parenthesis is the start of the second capturing group; and so on. Each capturing group ends at the matching closing parenthesis. It is possible to have one capturing group embedded in another. So, the pattern "I (am (Sam))" has two capturing groups. The first capturing group is the pattern "am Sam" and the second capturing group is the pattern "Sam".

The capturing group count and corresponding matched subsequence data are maintained in the Matcher object. A Matcher object's String group( int ) method "returns the input subsequence captured by the given group during the previous match operation." A Matcher object's int groupCount() method "returns the number of capturing groups in this matcher's pattern5."

Group count number zero refers to the entire pattern match, so matcher.group( 0 ) returns the entire previously matched subsequence and is equivalent to matcher.group() . Note that capturing group number zero is not included in the total group count returned by the groupCount() method7.

Consider this mildly more involved example demonstrating capturing groups. Note that the group construct limits the scope of the OR operator.

    input =
      "Humpty Dumpty sat on a wall. " +
      "Humpty Dumpty had a great fall. " +
      "All the king's horses and all the king's men " +
      "Couldn't put Humpty together again! " ;

    pattern = Pattern.compile( "((H|D)(umpty) ){2}" );
    // Matches six characters ending in "umpty" and 
    // beginning with "H" or "D".  Three capturing 
    // groups are defined and remembered by the Matcher.

    matcher = pattern.matcher( input );

    System.out.println( matcher.find() );       // Prints true.
    System.out.println( matcher.groupCount() ); // Prints 3.
    System.out.println( matcher.group( 1 ) );   // Prints "Dumpty ".
    System.out.println( matcher.group( 2 ) );   // Prints "D".
    System.out.println( matcher.group( 3 ) );   // Prints "umpty".
    System.out.println( matcher.group( 0 ) );   // Prints "Humpty Dumpty ".

    // If it was expected that matcher.group( 1 ) should contain 
    // "Humpty", then remember that the group( int ) method 
    // returns the input subsequence captured by the specified 
    // group during the previous match operation.  This match 
    // operation was performed two times - the first time matching 
    // "Humpty" and the second time matching "Dumpty".

Each matched group maintained in the Matcher object is called a "back reference". Referencing a matched group as demonstrated above is one style of back referencing in Java regular expressions. A later lesson will introduce another style and use of back referencing.


Non-Capturing Groups - (?:X)

A slight performance cost is associated with maintaining back references (the group count and matched subsequence data) in the Matcher object. The non-capturing group construct provides the function of grouping pattern elements without the cost of remembering each matched group as a back reference. The syntax for a non-capturing group is simply "(?:X)". A non-capturing group functions much like a capturing group with the distinction that no capturing group specific data is maintained in the Matcher.

    input =
      "Humpty Dumpty sat on a wall. " +
      "Humpty Dumpty had a great fall. " +
      "All the king's horses and all the king's men " +
      "Couldn't put Humpty together again! " ;

    pattern = Pattern.compile( "((?:H|D)(?:umpty) ){2}" );
    // Matches six characters ending in "umpty" and 
    // beginning with "H" or "D".  Three groups 
    // are defined, one is a capturing group that 
    // will be remembered by the Matcher.

    matcher = pattern.matcher( input );

    System.out.println( matcher.find() );       // Prints true.
    System.out.println( matcher.groupCount() ); // Prints 1.
    System.out.println( matcher.group( 1 ) );   // Prints "Dumpty ".
    System.out.println( matcher.group( 0 ) );   // Prints "Humpty Dumpty ".

According to J-Sprint's memory profiler, when the previous two code examples (the searches for Humpty Dumpty) were performed one hundred thousand times each, the non-capturing group strategy demonstrated a performance improvement of roughly 0.0003% - not much to write home about. Alternatively, tests against a larger input character sequence (50KB) composed of one hundred groups per match, resulted in the non-capturing group test consuming approximately 40% as much memory as the capturing group test.


Look Ahead and Look Behind Constructs - (?=X) , (?!X) , (?<=X) , (?<!X)

Java regular expressions provide two "look ahead" constructs. These constructs allow the description of a pattern where a specified pattern only matches if it is followed by the pattern described in the look ahead construct. The pattern described in the look ahead construct is not part of any matched subsequence described by the Matcher object - it is only a requirement that must be met in order for the specified pattern to match. Though the look ahead construct is contained within matching opening and closing parentheses, it is a non-capturing group construct.

    input =
      "Today's specials are apple chocolate pie and cherry banana pie." ;

    pattern = Pattern.compile( "(apple|cherry)(?= chocolate)" );
    // Matches "apple" or "cherry" where the following pattern 
    // matches " chocolate".  " chocolate" is not a part of the 
    // resulting match, it follows it.

    matcher = pattern.matcher( input );

    System.out.println( matcher.find() );       // Prints true.
    System.out.println( matcher.groupCount() ); // Prints 1.
    System.out.println( matcher.group( 1 ) );   // Prints "apple".
    System.out.println( matcher.group() );      // Prints "apple".

    pattern = Pattern.compile( "(apple|cherry)(?! chocolate)" );
    // Matches "apple" or "cherry" where the following pattern 
    // does not match " chocolate".

    matcher = pattern.matcher( input );

    System.out.println( matcher.find() );       // Prints true.
    System.out.println( matcher.groupCount() ); // Prints 1.
    System.out.println( matcher.group( 1 ) );   // Prints "cherry".
    System.out.println( matcher.group() );      // Prints "cherry".

Two "look behind" constructs provide a similar function as the look ahead constructs, the distinction being that the look behind constructs try to match whatever precedes a specified pattern. The pattern described in the look behind construct is not part of any matched subsequence described by the Matcher object - it is only a requirement that must be met in order for the specified pattern to match. The look behind construct is also non-capturing.

    input =
      "Tomorrow's special is fried bananas with baked clam." ;

    pattern = Pattern.compile( "(?<=fried )(bananas|clam)" );
    // Matches "bananas" or "clam" if preceded by "fried ". 
    // "fried " is not part of the resulting match, it precedes it.

    matcher = pattern.matcher( input );

    System.out.println( matcher.find() );       // Prints true.
    System.out.println( matcher.groupCount() ); // Prints 1.
    System.out.println( matcher.group( 1 ) );   // Prints "bananas".
    System.out.println( matcher.group() );      // Prints "bananas".

    pattern = Pattern.compile( "(?<!fried )(bananas|clam)" );
    // Matches "bananas" or "clam" if not preceded by "fried ". 

    matcher = pattern.matcher( input );

    System.out.println( matcher.find() );       // Prints true.
    System.out.println( matcher.groupCount() ); // Prints 1.
    System.out.println( matcher.group( 1 ) );   // Prints "clam".
    System.out.println( matcher.group() );      // Prints "clam".

During the critiquing process of this lesson, Jim Yingst pointed out an important issue concerning look ahead and look behind constructs:

Normally when we use find() repeatedly to match a pattern several times within a given input, each find() "consumes" characters in the input string up to the last character matched by the find. This means that a subsequent find() will not normally "see" any characters which were already matched by previous finds. Lookahead and lookbehind are workarounds for this limitation. We can look ahead to match characters without consuming them, or we can look back to match characters which were already consumed.

This non-consumptive behavior of the look ahead and look behind constructs is also known as zero-width matching.

Consider this related example, also suggested by Jim (and adapted to fit a nursery rhyme).

    input =
      "John Jacob Jingleheimer Schmidt " +
      "His name is my name, too! " +
      "Whenever we go out, " +
      "The people always shout " +
      "There goes John Jacob Jingleheimer Schmidt!" ;

    pattern = Pattern.compile( "(J\\w+)(?=.+Schmidt )" );
    // Matches all words starting with "J" that 
    // precede "Schmidt " (note the space following the t). 
    // The ".+Schmidt " part of the regular 
    // expression is not consumed.

    matcher = pattern.matcher( input );
    while ( matcher.find() )                 // Prints 
    {                                        //   "John"
      System.out.println( matcher.group() ); //   "Jacob"
    }                                        //   "Jingleheimer"

Flags - (?idmsux-idmsux) , (?idmsux-idmsux:X) , and Pattern.compile(String, int)

As previously mentioned, the Pattern class contains two static factory methods that create and return references to a Pattern object. Pattern.compile( String ) creates a new Pattern object from the specified String. Pattern.compile( String , int ) creates a new Pattern object from the specified String with the specified flag. This integer flag changes the way a Pattern matches an input sequence, such as turning on or off case sensitivity or whether the "." meta character can match a line terminator.

Another way to specify flags that adjust the way a Pattern matches is with a flag construct embedded in the regular expression itself. This construct takes the form "(?flags)" and appears as part of the String used to construct a Pattern. The embedded flag construct is a non-capturing group construct (in other words, the opening parenthesis does not count towards the overall group count). The embedded flag construct can be used in combination with the flag passed to the Pattern.compile( String , int ) factory method.

Available embedded flag constructs and available flags to specify when constructing a new Pattern object include:

embedded flags construction flags meanings *
(?i) Pattern.CASE_INSENSITIVE Enables case-insensitive matching.
(?d) Pattern.UNIX_LINES Enables Unix lines mode.
(?m) Pattern.MULTILINE Enables multi line mode.
(?s) Pattern.DOTALL Enables "." to match line terminators.
(?u) Pattern.UNICODE_CASE Enables Unicode-aware case folding.
(?x) Pattern.COMMENTS Permits white space and comments in the pattern.
--- Pattern.CANON_EQ Enables canonical equivalence.
* Please refer to The Pattern Class Documentation for more complete descriptions.

Consider an example with case insensitivity turned on.

    input =
      "Hey, diddle, diddle, " +
      "The cat and the fiddle, " +
      "The cow jumped over the moon. " +
      "The little dog laughed " +
      "To see such sport, " +
      "And the dish ran away with the spoon." ;

    pattern = Pattern.compile( "the \\w+?(?=\\W)" , Pattern.CASE_INSENSITIVE );
    // Matches "the " followed by any word, regardless of case.

    matcher = pattern.matcher( input );

    while ( matcher.find() )                   // Prints 
    {                                          //  The cat
        System.out.println( matcher.group() ); //  the fiddle
    }                                          //  The cow
                                               //  the moon
                                               //  The little
                                               //  the dish
                                               //  the spoon

An equivalent Pattern could be constructed using the "(?i)" embedded flag. If embedded at the beginning of the regular expression, this embedded flag would affect the entire regular expression as would any integer flag specified in Pattern.compile( String , int ) . In the previous example, pattern = Pattern.compile( "(?i)the \\w+?(?=\\W)" ) would have resulted in the same matched subsequences.

Multiple flags can be specified as embedded flags or as the integer argument to Pattern.compile( String , int ) . To specify multiple embedded flags, simply list them, one after the other. "(?is)" would specify that the Pattern is to match regardless of case and the "." meta character can match line terminators. To specify multiple flags as the integer argument to Pattern.compile( String , int ) , OR together the integer constants of the Pattern class that represent the desired behavior.

Consider the following example that demonstrates equivalent use of multiple embedded flags and OR'ed together integer constants passed as the integer flag argument to Pattern.compile( String , int ) .

    input =
      "Green cheese,\n" +
      "Yellow laces,\n" +
      "Up and down\n" +
      "The market places." ;

    pattern = Pattern.compile( "(?is)[a-z]*,.[a-z]*" );
    // Regardless of case, matches consecutive letters 
    // followed by a comma, any character, then more 
    // consecutive letters where the meta character "." 
    // may match line terminators.

    matcher = pattern.matcher( input );
    while ( matcher.find() )                 // Prints 
    {                                        //   cheese,
      System.out.println( matcher.group() ); //   Yellow
    }                                        //   laces,
                                             //   Up

    int flags = Pattern.CASE_INSENSITIVE | Pattern.DOTALL ;
    pattern = Pattern.compile( "[a-z]*,.[a-z]*" , flags );
    // Regardless of case, matches consecutive letters 
    // followed by a comma, any character, then more 
    // consecutive letters where the meta character "." 
    // may match line terminators.

    matcher = pattern.matcher( input );
    while ( matcher.find() )                 // Prints 
    {                                        //   cheese,
      System.out.println( matcher.group() ); //   Yellow
    }                                        //   laces,
                                             //   Up

The embedded flag construct affects the parts of a regular expression that appear after the embedded flag. If an embedded flag appears in the middle of a regular expression, then only the half of the expression, that appears after the flag, is potentially affected.

Using an embedded flag construct, it is possible to specify that only a section (a group) of a regular expression be affected by the flag. The syntax for such a construct is "(?flags:X)" , where "X" is the regular expression to be affected by the flags. This limits the scope of the flags to within the closing parentheses.

The embedded flag construct can also be used to turn off flags. The syntax to turn off a flag or flags is "(?-flags)". So, it is possible to specify a flag be turned on for the entire regular expression by specifying the appropriate integer using Pattern.compile( String , int ) and that the same flag should be turned off for a specified group of the regular expression using "(?-flag:X)".

    input =
      "HARK! HARK! The dogs do bark, " +
      "The beggars are coming to town. " +
      "Some in rags, " +
      "And some in tags, " +
      "And one in a velvet gown!" ;

    pattern = Pattern.compile( "(?-i:[A-Z])[A-Z]*" , Pattern.CASE_INSENSITIVE );
    // Matches any word, regardless of case except 
    // the first letter which must be capitalized.

    matcher = pattern.matcher( input );        // Prints 
    while ( matcher.find() )                   //   HARK
    {                                          //   HARK
        System.out.println( matcher.group() ); //   The
    }                                          //   The
                                               //   Some
                                               //   And
                                               //   And

The embedded flag constructs are non-capturing. So, the enclosing parentheses do not contribute to the Matcher's group count.






Notes and Resources
1 The Regular Expressions Tutorial at JavaRegex.com
2 java.util.regex Package API Documentation
3 java.lang.CharSequence is a new Interface in Java 1.4. A CharSequence is a readable sequence of characters. This interface provides uniform, read-only access to many different kinds of character sequences. String and StringBuffer both implement CharSequence. -- CharSequence API Documentation
4 java.util.regex.Pattern API Documentation
5 java.util.regex.Matcher API Documentation
6 The text for the nursery rhymes in this lesson can be found at http://www.zelo.com/family/nursery/index.asp .
7 Jim Yingst notes:

The API documentation for the groupCount() method of Matcher is misleading - it should say

"Any non-negative integer smaller than or equal to the value returned by this method is guaranteed to be a valid group index for this matcher."

The italicized section does not appear in the current API. (It's fixed in 1.4.1 beta source code, but that's not what you see when you browse the API online.)
8 The Regular Expression Library
9 For further reading on regular expressions, take a look at Mastering Regular Expressions by Jeffrey Friedl. A sample chapter from O'Reilly is available on-line. Jeffrey Friedl maintains a website assiciated with his book at http://regex.info at which he has posted a nice list of alternative Java Regex Packages.


Last Updated: 2003.03.31 0135