An Introduction to java.util.regex - Lesson 1

by Dirk Schreckmann

This series of lessons covering regular expressions in Java was modeled after the tutorial created to teach the com.stevesoft.pat package1. The com.stevesoft.pat package is available for download and use from JavaRegex.com. It's an excellent alternative package to harvest the power of regular expressions in Java.

Part 1: Basic Pattern Elements

What is a regular expression and what are they good for?

For the purposes of this tutorial, let's loosely define a regular expression as a pattern of text. For a more formal definition of regular expression, try a search on google for Chomsky + hierarchy + regular + expression.

Regular expressions can be used to verify a data input format (such as a phone number, a zip code, an email address, a social security number, or even the JavaRanch naming policy), to search for text data that matches a specified pattern, or to find potential grammatical errors such as repetitious words (which a spell checking algorithm might not find).

The java.util.regex package provides two classes for comparing a regular expression against an input String (actually against any CharSequence object2). A Pattern object is a compiled representation of a regular expression3. A Matcher object is an engine that performs match operations on a character sequence by interpreting a Pattern4. The Pattern class contains two static factory methods, compile( String ) and compile( String , int ) , that create and return a Pattern object. The Pattern instance method matcher( CharSequence ) creates and returns a Matcher object based on the Pattern instance and the input CharSequence.

Four methods are available to a Matcher object to match part or all of an input String against a regular expression pattern. From the Matcher class documentation4:

Each of these methods will return true if the pattern matched part or all of the input String as appropriate. As a side effect, the find methods set internal state information in the Matcher object that is used to track the last location where a match occurred.


Plain Text

The most basic regular expression is a literal text string. The alphanumeric characters are interpreted literally1.

The word "shells" can be found in a String as follows:

    String input = "She sells sea shells by the sea shore." ;
    
    // Create a Pattern object.  A Pattern is a 
    // compiled representation of a regular expression.
    Pattern pattern = Pattern.compile( "shells" );
    
    // Create a Matcher object.  A Matcher is an engine that 
    // performs match operations on a character sequence by 
    // interpreting a Pattern.
    Matcher matcher = pattern.matcher( input );
    
    System.out.println( matcher.matches() );
    // Prints false.  matcher.matches() attempts to 
    // match the entire input sequence against the pattern.  
    // The match would have succeeded, if the pattern described 
    // the entire input String.
    
    System.out.println( matcher.lookingAt() );
    // Prints false.  matcher.lookingAt() attempts to 
    // match the input sequence, starting at the beginning, 
    // against the pattern.  The match would have succeeded, 
    // if the pattern were "She".
    
    System.out.println( matcher.find() );
    // Prints true.  matcher.find() attempts to find 
    // the next subsequence of the input sequence that 
    // matches the pattern.

Matcher and Pattern objects have methods to return the parts of an input String matched against a pattern. The String group() method of the Matcher class returns the input subsequence matched by the previous match4. The String[] split( CharSequence ) method of the Pattern class splits the specified input character sequence around matches of a pattern3. (Since Java 1.4, String objects also have two split methods that take a regular expression as a parameter and split the String object around matches of the regular expression, returning a String array of the result.)

    // The Matcher must be reset and searched again after 
    // failed attempts at matching.  The Matcher would  
    // otherwise not be in a proper state to get the 
    // information related to the last match and methods 
    // that query for such information would throw an 
    // IllegalStateException.  Note that the find() method 
    // resets the Matcher if previous attempts at matching 
    // failed.
    
    System.out.println( matcher.find() );       // Prints true.
    
    System.out.println( matcher.group() );      // Prints shells.
    
    String[] splits = pattern.split( input );
    System.out.println( splits.length );        // Prints 2.
    
    for ( int i = 0 ; i < splits.length ; i++ ) // Prints 
    {                                           // She sells sea
        System.out.println( splits[ i ] );      //  by the sea shore.
    }

Note that regular expression patterns in Java are by default case sensitive. So, using the same input, a search for "Shells" would have failed.

    pattern = Pattern.compile( "Shells" );
    
    // Since a new Pattern is to be used, a new Matcher 
    // must also be used.
    matcher = pattern.matcher( input );
    
    System.out.println( matcher.find() ); // Prints false.

Meta characters

Meta characters, in regular expressions, are characters that allow the description of more flexible text patterns than exact text (such as "shells" or "shore"). Meta characters are not interpreted literally. Meta characters used in describing Java regular expressions include "." , "?" , "+" , "*" , "{" , "}" , "(" , ")" , "[" , "]" , "$" , "^" , "|" , "&&" , "<" , "!" , "=" , ":" , and "\" . Some of these meta characters and their uses are introduced in this lesson. Further meta characters and their uses will be covered in following lessons.


Repetition Quantifiers

Characters such as "?" , "+" , "*" , and "{}" provide the ability to match a pattern element multiple (or zero) times.

? Matches the preceding element zero or one times.
+ Matches the preceding element one or more times.
* Matches the preceding element zero or more times.
{n} Matches the preceding element n number of times.
{min,max} Matches the preceding element a specified number of times from min to max inclusive.
{min,} Matches the preceding element min or more times.

These elements must be prefixed by another pattern element to be matched.

    input = "Some say, ABBA is a grooooovy music group." ;
    
    pattern = Pattern.compile( "is?" );
    // Matches the "i" character or the "i" character 
    // followed by an "s".
    
    matcher = pattern.matcher( input );
    System.out.println( matcher.find() );  // Prints true.
    System.out.println( matcher.group() ); // Prints is.
    
    pattern = Pattern.compile( "x?" );
    // Matches the "x" character or nothing.
    
    matcher = pattern.matcher( input );
    System.out.println( matcher.find() );  // Prints true.
    System.out.println( matcher.group() ); // Prints the empty String.
    
    pattern = Pattern.compile( "AB+A" );
    // Matches the sequence of characters that begins 
    // and ends with "A" with one or more "B" characters 
    // in the middle.
    
    matcher = pattern.matcher( input );
    System.out.println( matcher.find() );  // Prints true.
    System.out.println( matcher.group() ); // Prints ABBA.
    
    pattern = Pattern.compile( "AB*C*A" );
    // Matches the sequence of characters that begins 
    // and ends with "A" and has zero or more "B" 
    // characters followed by zero or more "C" characters 
    // in the middle.
    
    matcher = pattern.matcher( input );
    System.out.println( matcher.find() );  // Prints true.
    System.out.println( matcher.group() ); // Prints ABBA.
    
    pattern = Pattern.compile( "gro{2,}vy" );
    // Matches the sequence of characters that begins 
    // "gr" , ends with "vy" , and has two or more "o" 
    // characters in the middle.
    
    matcher = pattern.matcher( input );
    System.out.println( matcher.find() );  // Prints true.
    System.out.println( matcher.group() ); // Prints grooooovy.
    
    pattern = Pattern.compile( "gro?vy" );
    // Matches the sequence of characters that begins 
    // "gr" , ends with "vy" , and has zero or one "o" 
    // characters in the middle.
    
    matcher = pattern.matcher( input );
    System.out.println( matcher.find() );  // Prints false.

Character Classes

Square brackets, "[]" , allow the description of a set of characters (known as a character class), of which one and only one character must match.

    pattern = Pattern.compile( "[Rr]egular" );
    // Matches "Regular" or "regular".
    
    input = "I like regular expressions." ;
    matcher = pattern.matcher( input );
    System.out.println( matcher.find() );  // Prints true.
    System.out.println( matcher.group() ); // Prints regular.
    
    input = "Regular expressions are great." ;
    matcher.reset( input );
    // Since the same Pattern is to be used to match 
    // against a new input, the Matcher need only 
    // be reset using the new input sequence 
    // (a new Matcher need not be obtained.)
    
    System.out.println( matcher.find() );  // Prints true.
    System.out.println( matcher.group() ); // Prints Regular.

To match any single lowercase letter of the english alphabet, it's possible to specify such a pattern as:

Fortunately, a few "shortcut" regular expression constructs are available. To match a range of characters, the "-" character can be used. So, "[a-z]" describes the same range of characters above.

The regular expression syntax allows two styles to describe the union of two or more character classes. "[a-c[f-h]]" and "[a-cf-h]" both describe the character class "[abcfgh]".

Other "shortcut" constructs include these often used predefined character classes3:

.matches a single character (may or may not match line terminators)
\dmatches a digit: [0-9]
\D matches a non-digit: [^0-9] *
\s matches a whitespace character: [ \t\n\x0B\f\r] (see footnote on characters)
\S matches a non-whitespace character: [^\s] *
\wmatches a word character: [a-zA-Z_0-9]
\W matches a non-word character: [^\w] *
 * "^" is the NOT operator and is covered further down on this page.

Note that the backslash character must be escaped (quoted) when used in a Java String in order for it to be interpreted as a String literal. So, when specifying a predefined character to use, don't forget to escape the backslash with another backslash. For example, to match the word eat surrounded by whitespace, the pattern \seat\s must be specified as "\\seat\\s".

    pattern = Pattern.compile( "\\seat\\s" );
    // Matches "eat" surrounded by whitespace.
    
    input = "When do we eat?" ;
    matcher = pattern.matcher( input );
    System.out.println( matcher.find() );  // Prints false.
    
    input = "We eat when we're hungry." ;
    matcher.reset( input );
    System.out.println( matcher.find() );  // Prints true.
    System.out.println( matcher.group() ); // Prints " eat ".

The OR , NOT , and AND Operators

"|" is the OR operator. When used outside of a character class, it allows the matching of one or the other pattern. Inside of a character class, "|" is interpreted as a literal character and serves no meta character function.

    pattern = Pattern.compile( "apple|orange" );
    // Matches "apples" or "oranges".
    
    input = "I ate my orange." ;
    matcher = pattern.matcher( input );
    System.out.println( matcher.find() );  // Prints true.
    System.out.println( matcher.group() ); // Prints orange.
    
    input = "I ate my apple." ;
    matcher.reset( input );
    System.out.println( matcher.find() );  // Prints true.
    System.out.println( matcher.group() ); // Prints apple.
    
    input = "I don't have any more fruit." ;
    matcher.reset( input );
    System.out.println( matcher.find() );  // Prints false.

"^" is the NOT operator when prefixed inside a character class. "[^b]" would match any character other than a "b". Note that it does not match the empty string. Outside a character class, "^" takes on a different meaning and is covered in a later lesson.

    pattern = Pattern.compile( "[^b]lop" );
    // Matches any four characters, of which the first 
    // cannot be "b" and the last three must be "lop".
    
    input = "blop" ;
    matcher = pattern.matcher( input );
    System.out.println( matcher.find() );  // Prints false.
    
    input = "flop" ;
    matcher.reset( input );
    System.out.println( matcher.find() );  // Prints true.
    System.out.println( matcher.group() ); // Prints flop.
    
    input = "lop" ;
    matcher.reset( input );
    System.out.println( matcher.find() );  // Prints false.
                                           // The empty String does 
                                           // not match "^b".

"&&" is the AND operator. It allows the definition of two conditions for a pattern element to match.

    pattern = Pattern.compile( "[a-z&&[^aeiou]].{5}" );
    // Matches any six characters where the first character 
    // is a lower case letter that is not a vowel.
    
    input = "Every lamb counts fish." ;
    matcher = pattern.matcher( input );
    
    while ( matcher.find() ) 
    {
        System.out.println( matcher.group() ); // Prints "very l"
                                               //        "mb cou"
                                               //        "nts fi"
    }

Maximal and Minimal (Greedy and Reluctant) Matching Behavior

Note that the default behavior of the regular expression elements presented so far is to be greedy (hungry) and match as much of the input as possible (a maximal match).

    input = "1001 0101 0011 1100" ;
    
    pattern = Pattern.compile( "1.*1" );
    // Matches the sequence of characters where the first 
    // and last character is "1" with zero or more characters 
    // in between.
    
    matcher = pattern.matcher( input );
    System.out.println( matcher.find() );  // Prints true.
    System.out.println( matcher.group() ); // Prints 1001 0101 0011 11.

A minimal match, also known as a reluctant match, would match as little of the input sequence as possible in order to satisfy the entire regular expression. In this example, such a match would have matched the input sequence four times, the first matching subsequence being "1001". To specify a minimal match, follow the appropriate repetition quantifier construct with a "?" . "?" modifies the behavior of the repetition quantifier to match reluctantly.

    input = "1001 0101 0011 1100" ;
    
    pattern = Pattern.compile( "1.*?1" );
    // Matches the sequence of characters where the first 
    // and last character is "1" with as few as possible 
    // characters in between to make the pattern match.
    
    matcher = pattern.matcher( input );
    
    while ( matcher.find() )                 // Prints 
    {                                        //  1001
      System.out.println( matcher.group() ); //  101
    }                                        //  11
                                             //  11

?? Matches the preceding element up to one time, as few times as possible.*
+? Matches the preceding element one or more times, as few times as possible.*
*? Matches the preceding element zero or more times, as few times as possible.*
{min,max}? Matches the preceding element from min to max inclusive, as few times as possible.*
{min,}? Matches the preceding element min or more times, as few times as possible.*
  * Note that "as few times as possible" means "as few times as possible in order to satisfy the entire regular expression being matched."





Notes and Resources
1 The Regular Expressions Tutorial at JavaRegex.com
2 java.lang.CharSequence is a new Interface in Java 1.4. A CharSequence is a readable sequence of characters. This interface provides uniform, read-only access to many different kinds of character sequences. String and StringBuffer both implement CharSequence. -- CharSequence API Documentation
3 java.util.regex.Pattern API Documentation
4 java.util.regex.Matcher API Documentation
5 java.util.regex Package API Documentation
6 From The Pattern Class Documentation:

Valid Characters in regular expressions
x      - The character x 
\\     - The backslash character 
\0n    - The character with octal value 0n (0 <= n <= 7) 
\0nn   - The character with octal value 0nn (0 <= n <= 7) 
\0mnn  - The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7) 
\xhh   - The chara

cter with hexadecimal value 0xhh 
\uhhhh - The character with hexadecimal value 0xhhhh 
\t     - The tab character ('\u0009') 
\n     - The newline (line feed) character ('\u000A') 
\r     - The carriage-return character ('\u000D') 
\f     - The form-feed character ('\u000C') 
\a     - The alert (bell) character ('\u0007') 
\e     - The escape character ('\u001B') 
\cx    - The control character corresponding to x 
7 The Regular Expression Library
8 For further reading on regular expressions, take a look at Mastering Regular Expressions by Jeffrey Friedl. A sample chapter from O'Reilly is available on-line.
9 The code examples in this article were formatted using the JavaCodeSyntaxHighlighter filter created by John Sun for use with Jive Software's Jive Forums application.



A special thank you to (in alphabetical order) Cindy Glass, Mapraputa Is, Thomas Paul, Matthew Phillips, Mark Spritzler, Janet Wilson, and Jim Yingst for their valuable feedback and help in improving this lesson from its first draft edition.


Last Updated: 2002.08.30 17:51