MySQL Regular Expression Searches

Revision as of 18:51, 19 October 2007 by Neil (Talk | contribs) (Matching by Text Position)

Revision as of 18:51, 19 October 2007 by Neil (Talk | contribs) (Matching by Text Position)

In previous chapters we have looked at some basic data retrieval filtering techniques using comparison operators and wildcards. Whilst this approaches to data filtering are fine for basic searching, more advanced searching quickly becomes cumbersome using these techniques.

In this chapter we will look at yet another filtering and searching technique supported by MySQL called regular expressions.

What are Regular Expressions?

The term regular expressions is one of those technical terms where the words do very little to explain what the technology does. regular expressions are a feature common to many programming languages and are a topic on which entire can, and indeed, have been written. Fortunately (or unfortunately depending on your perspective), MySQL does not provide as extensive support for regualr expressions as some other programming languages. This is good in that it makes the subject area easier to learn, but may be frustrating if you are already proficient with the rich regular expression syntax available in other languages.

Regular expressions are essentially a set of instructions using a predefined syntax for matching text in a variety of flexible ways. For example, you might want to extract all the occurrences of a particular word sequence from a block of text. You might also want to perform a search for a particular piece of text and replace it with some alternate text.

Regular Expression Character Matching

In order to introduce the REGEXP operator, we will begin by looking at a use of regular expressions that could similarly be used with the LIKE operator. As in the previous chapter we need to retrieve rows from a table taking into consideration the difference in spelling of the color gray (grey). To perform this retrieval we will use the regex dot character matching (.). Rather like the LIKE underscore wildcard, this character indicates that any character in this position in the text will be considered a match. For example:

mysql> SELECT * FROM product WHERE prod_name REGEXP 'Gr.y Computer Case';
+-----------+--------------------+-------------------+
| prod_code | prod_name          | prod_desc         |
+-----------+--------------------+-------------------+
|        11 | Grey Computer Case | ATX PC CASE       |
|        12 | Gray Computer Case | ATX PC CASE (USA) |
+-----------+--------------------+-------------------+
2 rows in set (0.05 sec)

So far we haven't done anything that could not have been achieved using wildcards. Regular expressions, however, do not stop here.


Matching from a Group of Characters

One problem with the approach outlined above is that any letter between the 'Gr' and the 'y' would have registered as a match. In reality we are only interested in words that contain either an 'a' or an 'e' in that location. Fortunately, regular expressions allow us to specify a group of acceptable character matches for any character position. The syntax for this requires that the characters be places in square brackets at the desired location in the match text. For example:

mysql> SELECT * FROM product WHERE prod_name REGEXP 'Gr[ae]y Computer Case';
+-----------+--------------------+-------------------+
| prod_code | prod_name          | prod_desc         |
+-----------+--------------------+-------------------+
|        11 | Grey Computer Case | ATX PC CASE       |
|        12 | Gray Computer Case | ATX PC CASE (USA) |
+-----------+--------------------+-------------------+
2 rows in set (0.00 sec)

Use of this syntax ensures that only the words 'Grey' and 'Gray will match the search criteria. There is no limit to the number of characters that can be grouped in the brackets when using this filtering technique.

Matching from a Range of Characters

The character group matching syntax can be extended to cover range of characters. For example, instead of declaring a regular expression to cover the letters between A and F as [ABCDEF] we could simply specify a range of characters using the '-' character between the upper and lower ranges [A-F]. We could, for example, list a product based on certian model numbers which begin with numbers ranging from 1 to 6:

mysql> SELECT * FROM product WHERE prod_name REGEXP 'CD-RW Model [1-6]543';
+-----------+------------------+-----------+
| prod_code | prod_name        | prod_desc |
+-----------+------------------+-----------+
|         1 | CD-RW Model 4543 | CD Writer |
|        14 | CD-RW Model 5543 | CD Writer |
|        15 | CD-RW Model 6543 | CD Writer |
|        16 | CD-RW Model 2543 | CD Writer |
+-----------+------------------+-----------+
4 rows in set (0.00 sec)

Handling Special Characters

As you have seen, regular expressions assign special meaning to particular characters. For example the dot (.) and square brackets ([]) all have special meaning. Those who studying critical thinking at college will already be questioning what to do if the character sequence that is the subject of a search contains one or more of these characters. Obviously if you are are looking for the following text that looks like a regular expression, the text for which you want to search is, itself, going to be viewed as regular expression syntax.

To address this issue, a concept known as escaping is used. In SQL, escaping involves preceding any characters that may be mis-interpreted as a regular expression special character with double back slashes (\\). For example, suppose we have a row in our product table which reads as follows:

+-----------+--------------------+-----------+
| prod_code | prod_name          | prod_desc |
+-----------+--------------------+-----------+
|        17 | CD-RW Model [7543] | CD Writer |
+-----------+--------------------+-----------+

If we were to search for this without regard to the fact that the prod_name value contains regular expression special characters we will not get what we wanted:

mysql> SELECT * FROM product WHERE prod_name REGEXP 'CD-RW Model [7543]';
+-----------+------------------+-----------+
| prod_code | prod_name        | prod_desc |
+-----------+------------------+-----------+
|         1 | CD-RW Model 4543 | CD Writer |
|        14 | CD-RW Model 5543 | CD Writer |
+-----------+------------------+-----------+
2 rows in set (0.00 sec)

The cause of the problem is that the regular expression has been interpreted as a search for any entries that read 'CD-RW Model' followed by either a 7, 5, 4 or 3. Clearly, what we wanted was the actual text [7543]. If instead we escape the brackets with the \\ escape sequence we get the result we want:

mysql> SELECT * FROM product WHERE prod_name REGEXP 'CD-RW Model \\[7543\\]';
+-----------+--------------------+-----------+
| prod_code | prod_name          | prod_desc |
+-----------+--------------------+-----------+
|        17 | CD-RW Model [7543] | CD Writer |
+-----------+--------------------+-----------+
1 row in set (0.00 sec)

Regular Expressions and Whitespace Characters

Regular expression syntax also provides a mechanism to reference whitespace characters such as tabs, carriage returns and line feeds. These are referenced in a regular expression using metacharacters. These metacharacters are outlined in the following table:

MetacharacterDescription
\\nNew line (line feed)
\\fForm feed
\\tTab
\\rCarriage return
\\vVertical tab


Matching by Character Type

Another useful regular expression trick is to match characters by type or class. For example we might need to specify that a character must be a letter, a number or a alphanumeric. This involves the use of some special class definitions outlines in the following table. Some examples of these special classes in action follow the table:

Class KeywordDescription of Matches
[[:alnum:]]Alphanumeric - any number or letter. Equivalent to [a-z], [A-Z] and [0-9]
[[:alpha:]]Alpha - any letter. Equivalent to [a-z] and [A-Z]
[[:blank:]]Space or Tab. Equivalent to [\\t] and [ ]
[[:cntrl:]]ASCII Control Character
[[:digit:]]Numeric. Equivalent to [0-9]
[[:graph:]]Any character with the exception of space
[[:lower:]]Lower case letters. Equivalent to [a-z]
[[:print:]]Any printable character
[[:punct:]]Characters that are neither control characters, nor alphanumeric (i.e punctuation characters)
[[:space:]]Any whitespace character (tab, new line, form feed, space etc)
[[:upper:]]Upper case letters. Equivalent to [A-Z]
[[:xdigit:]]Any hexadecimal digit. Equivalent to [A-F], [a-f] and [0-9]

Let's now look at some examples. Suppose in our product database we have two products with similar names, the 'One&One VoIP Headset' and the "One2One USB Hub'. In order to retrieve the 'One2One' product we would use the [:digit:] character class:

mysql> SELECT * FROM product WHERE prod_name REGEXP 'One[[:digit:]]One';
+-----------+-----------------+----------------+
| prod_code | prod_name       | prod_desc      |
+-----------+-----------------+----------------+
|        19 | One2One USB Hub | 4 Port USB Hub |
+-----------+-----------------+----------------+
1 row in set (0.00 sec)

Similarly, to extract the 'One&One' product we would use the [:punct:] class:

mysql> SELECT * FROM product WHERE prod_name REGEXP 'One[[:punct:]]One';
+-----------+----------------------+----------------+
| prod_code | prod_name            | prod_desc      |
+-----------+----------------------+----------------+
|        18 | One&One VoIP Headset | Stereo Headset |
+-----------+----------------------+----------------+
1 row in set (0.02 sec)

Regular Expression Repetition Metacharacters

In addition to allowing searches on single instances, regular expressions can also be written to look for repetition in text. this is achieved using a set of repetition metacharacters:

MetacharacterDescription
*Any number of matches
+One or more matches
{n}n number of matches
{n,}Not less than n number of matches
{n1,n2}A range of matches between n1 and n2
?Optional single character match (character my be present or not to qualify for a match)

As always, example do a much better job of demonstrating a concept than data in a table. Let's begin by searching for all 4 digit number sequences in our prod_name column:

mysql> SELECT * FROM product WHERE prod_name REGEXP '[[:digit:]]{3}';
+-----------+---------------------+-----------------+
| prod_code | prod_name           | prod_desc       |
+-----------+---------------------+-----------------+
|         1 | CD-RW Model 4543    | CD Writer       |
|         2 | EasyTech Mouse 7632 | Cordless Mouse  |
|         3 | WildTech 250Gb 1700 | SATA Disk Drive |
|         7 | Dell XPS 400        | Desktop PC      |
|        14 | CD-RW Model 5543    | CD Writer       |
|        15 | CD-RW Model 6543    | CD Writer       |
|        16 | CD-RW Model 2543    | CD Writer       |
|        17 | CD-RW Model [7543]  | CD Writer       |
+-----------+---------------------+-----------------+
8 rows in set (0.00 sec)

In the above example we have indicated that we are looking for any sequence of 3 digits by using the digit:{3} regular expression. In this case we have picked up entries with both 3 and 4 digits in a sequence. Suppose, instead we wanted only 4 digit sequences:

mysql> SELECT * FROM product WHERE prod_name REGEXP '[[:digit:]]{4,}';
+-----------+---------------------+-----------------+
| prod_code | prod_name           | prod_desc       |
+-----------+---------------------+-----------------+
|         1 | CD-RW Model 4543    | CD Writer       |
|         2 | EasyTech Mouse 7632 | Cordless Mouse  |
|         3 | WildTech 250Gb 1700 | SATA Disk Drive |
|        14 | CD-RW Model 5543    | CD Writer       |
|        15 | CD-RW Model 6543    | CD Writer       |
|        16 | CD-RW Model 2543    | CD Writer       |
|        17 | CD-RW Model [7543]  | CD Writer       |
+-----------+---------------------+-----------------+
7 rows in set (0.00 sec)

Here we see that our Dell XPS 400 is no longer listed because it has only 3 digits.

The '?' metacharacter is particularly useful when we need to allow for plural words. For example, we may want to list any product descriptions where the word Drive or Drives is used. To achieve this we simply follow the 's' with a '?', thereby making the trailing 's' optional for a match:

mysql> SELECT * FROM product WHERE prod_desc REGEXP 'Drives?';
+-----------+------------------------+--------------------+
| prod_code | prod_name              | prod_desc          |
+-----------+------------------------+--------------------+
|         3 | WildTech 250Gb 1700    | SATA Disk Drive    |
|        20 | MasterSlave Multi-pack | 5 SATA Disk Drives |
+-----------+------------------------+--------------------+
2 rows in set (0.00 sec)

Matching by Text Position

The final area of regular expressions to cover in this chapter involves matching based on the location of text in a string. For example, we may want to find a particular match that requires that a word appears at the beginning or end of a piece of text. Once again, this requires the use of some special metacharacters:

MetacharacterDescription
^Beginning of text
$End of text
[[:<:]]Start of word
[[:<:]]End of word

For example, to search for text that begins with a digit:

mysql> SELECT prod_desc FROM product WHERE prod_desc REGEXP '^digit:'; +--------------------+ | prod_desc | +--------------------+ | 4 Port USB Hub | | 5 SATA Disk Drives | +--------------------+ 2 rows in set (0.00 sec)