MySQL Regular Expression Searches

From Techotopia
Revision as of 18:26, 16 October 2007 by Neil (Talk | contribs) (Matching from a Range of Characters)

Jump to: navigation, search

In previous chapters we have looked at some basic data retrieval filtering techniques using comparison operators and wildcards. Whilst this approaches to data filtering are fine for basic searching, more advanced searching quickly becomes cumbersome using these techniques.

In this chapter we will look at yet another filtering and searching technique supported by MySQL called regular expressions.


Contents


What are Regular Expressions?

The term regular expressions is one of those technical terms where the words do very little to explain what the technology does. regular expressions are a feature common to many programming languages and are a topic on which entire can, and indeed, have been written. Fortunately (or unfortunately depending on your perspective), MySQL does not provide as extensive support for regualr expressions as some other programming languages. This is good in that it makes the subject area easier to learn, but may be frustrating if you are already proficient with the rich regular expression syntax available in other languages.

Regular expressions are essentially a set of instructions using a predefined syntax for matching text in a variety of flexible ways. For example, you might want to extract all the occurrences of a particular word sequence from a block of text. You might also want to perform a search for a particular piece of text and replace it with some alternate text.

Regular Expression Character Matching

In order to introduce the REGEXP operator, we will begin by looking at a use of regular expressions that could similarly be used with the LIKE operator. As in the previous chapter we need to retrieve rows from a table taking into consideration the difference in spelling of the color gray (grey). To perform this retrieval we will use the regex dot character matching (.). Rather like the LIKE underscore wildcard, this character indicates that any character in this position in the text will be considered a match. For example:

mysql> SELECT * FROM product WHERE prod_name REGEXP 'Gr.y Computer Case';
+-----------+--------------------+-------------------+
| prod_code | prod_name          | prod_desc         |
+-----------+--------------------+-------------------+
|        11 | Grey Computer Case | ATX PC CASE       |
|        12 | Gray Computer Case | ATX PC CASE (USA) |
+-----------+--------------------+-------------------+
2 rows in set (0.05 sec)

So far we haven't done anything that could not have been achieved using wildcards. Regular expressions, however, do not stop here.


Matching from a Group of Characters

One problem with the approach outlined above is that any letter between the 'Gr' and the 'y' would have registered as a match. In reality we are only interested in words that contain either an 'a' or an 'e' in that location. Fortunately, regular expressions allow us to specify a group of acceptable character matches for any character position. The syntax for this requires that the characters be places in square brackets at the desired location in the match text. For example:

mysql> SELECT * FROM product WHERE prod_name REGEXP 'Gr[ae]y Computer Case';
+-----------+--------------------+-------------------+
| prod_code | prod_name          | prod_desc         |
+-----------+--------------------+-------------------+
|        11 | Grey Computer Case | ATX PC CASE       |
|        12 | Gray Computer Case | ATX PC CASE (USA) |
+-----------+--------------------+-------------------+
2 rows in set (0.00 sec)

Use of this syntax ensures that only the words 'Grey' and 'Gray will match the search criteria. There is no limit to the number of characters that can be grouped in the brackets when using this filtering technique.

Matching from a Range of Characters

The character group matching syntax can be extended to cover range of characters. For example, instead of declaring a regular expression to cover the letters between A and F as [ABCDEF] we could simply specify a range of characters using the '-' character between the upper and lower ranges [A-F]. We could, for example, list a product based on certian model numbers which begin with numbers ranging from 1 to 6:

mysql> SELECT * FROM product WHERE prod_name REGEXP 'CD-RW Model [1-6]543';
+-----------+------------------+-----------+
| prod_code | prod_name        | prod_desc |
+-----------+------------------+-----------+
|         1 | CD-RW Model 4543 | CD Writer |
|        14 | CD-RW Model 5543 | CD Writer |
|        15 | CD-RW Model 6543 | CD Writer |
|        16 | CD-RW Model 2543 | CD Writer |
+-----------+------------------+-----------+
4 rows in set (0.00 sec)

Handling Special Characters