Regular Expressions

A regular expression is a pattern that describes a set of strings. They are constructed analagously to arithmetic expressions, by using various operators and constructs to make large expressions from smaller ones.

There's actually quite a few different flavors of regular expressions, or regex for short. In practise, the main features and feel between the different implementations is pretty standard. So if you know one, you know almost all. But it does cause enough confusion that most people play regex by feel...

If we restrict ourselves to GNU regex, that is, regex used by the grep family, sed, gawk, vim and others, there are still two flavors: basic (which is mostly classic regex from before Linux) and extended, which for the most part is implemented by all relatively new software including GNU stuff and scripting languages like Perl and Tcl.

Fundamental Rule

Most letters, numbers or characters match themselves.

A Matches A.

MiSsIsSiPpI-5== Matches MiSsIsSiPpI-5==.

The Dot

The dot represents any single character.

Cat. Matches Cat followed by any single letter, character or number, like Cats, Cat1 or Cat`.

Anchors

The ^ (when not used within square braces) matches the beginning of a line.

^Hello Matches any line that begins with the word Hello.

The $ matches the end of the line.

Goodbye$ Matches any line that ends with the word Goodbye.

Repetition

The * is used everywhere. It makes the preceding item match zero or more times. Take note!

^AB*$ Matches an A on a line by itself. It also matches a line beginning with an A followed by any number of B's.

The + matches the preceding item one or more times.

^AB+$ Matches an A followed by one or more B's.

Curly Braces

The {m,n} notation means at least m and no more than n of of the immediately preceeding character or group.

x{1,5} A string of x's: at least one, and up to 5.

The {n,} regular expression means the preceding item is matched n times or more

x{5,} A string of 5 or more x's

Finally, {m} matches the preceding item exactly m times.

x{10} A string of exactly 10 x's

Square Braces

The [ ] grouping contains a list of matchable values.

[a5Y] Matches an a, 5 or Y

The - character is a shorthand whose value is all the values between and including the characters immediately before and after it.

[0-9]{3} A three digit number.

When a ^ is next to the left brace, then anything is matched except the contents of the braces. Ranges are allowed.

[^a-z24680] Matches a single character which isn't a lowercase letter or even number.

The Escape Character

The backslash is an escape character. It robs the following character of any special meaning that character may have.

^\... A line beginning with a period followed by any two characters.

Tagged Expressions

Any string which is matched between $ and $ is available as \1 in your regular expressions.

$.$\1 Matches any double letter, like aa, bb, etc.

^$.$.*\1$ Matches lines which end with the same letter they started with.

You can tag more than one string. The tags are refered to by \1, \2, \3, etc.

$.$$.$\2\1 Matches all 4 letter palindromes like deed, noon and peep.

Palindromes

You can construct regex that match palindromes. Try this:

grep -e '^$.$$.$\2\1$' /usr/share/dict/words

This will print out all 4 letter palindromes listed in /usr/share/dict/words. Can you see why ^$.$$.$\2\1$ represents a palindrome? We can find 6 letter palindromes too:

grep -e '^$.$$.$$.$\3\2\1$' /usr/share/dict/words

We need a new regex for each palindrome of a different length. Can we construct a regex that will match a palindrom of arbitrary letter length? The answer turns out to be no.

A DFA is much like a Turing machine. The machine has a one-way read-only tape, which contains the word to be recognized. It has a read head, a finite number of states, and a transition function. At any given step, the read head reads a letter, and the machine switches states according to the transition function (given current state, and the letter). If, after reading the word, the DFA ends up in one of the pre-determined "accept" states, then the DFA accepts the word. Otherwise, it rejects the word.

There is a duality between regex and Deterministic Finite Automata (DFA). Any language (i.e. collection of words) which can be recognized by one can be recognized by the other.

One can think of the states roughly as some sort of memory for the DFA.

Now suppose that a DFA has N states. Then after some thinking, you should be able to convince yourself that it would not be able to recognize all palindromes of length 2N+1, because the DFA does not have enough "memory" to handle all possible cases.