Irregular Language and Regular Expressions

The other day a friend tweeted about “a regular expression to tell if a number was prime.” I followed the link with my eyebrows raised, because I was fairly sure that was beyond the capabilities of a regular expression.

As I suspected, it is. The post she linked to described a lovely piece of programming, but not a regular expression. Being an inveterate blue-pencil wielder, I pointed this out, and she (not incorrectly) said I was being fussy.

She meant that I was insisting on the formal definition of “regular expression,” which is much stricter than the common informal usage. However, in my own defense, I’ll point out that the programming trick in question wasn’t (just) a regular expression, even by the more common definition.

By now, you probably realize two things. First, this post involves the kind of nitpicking that will aggravate anyone who doesn’t like being corrected when they use “they” as a singular pronoun. Second, we’ll be discussing regular expressions, which to the untrained eye look like line noise from the old dialup modem days. If your eyes cross when reading  s/^[!\]]+/\[/, you may want to move on.

What is a “regular expression”?

The post she linked to described a one-line Perl program to recognize prime numbers. Perl, probably my favorite procedural computer language, has a very powerful pattern-matching language. This pattern-matching language uses what most people call “regular expressions” to describe the pattern to be matched. They are essentially computer programs written in a very concise and difficult-to-read language, in which most commands consist of single characters.

But “regular expressions” originally had nothing to do with pattern matching. They come from a field of mathematics called “formal languages,” which is extremely important in computer science. While Perl’s pattern-matching language is based on the concepts of regular expressions, and shares some syntax with them, it does not use actual regular expressions. Larry Wall, the inventor of Perl, calls them “regexes” or “regexen” for this reason.

Formal languages

The precisely named field of formal languages is a way to reason mathematically about languages, at least, those that can be precisely specified. (In other words, programming languages, and not human languages; the implementations of almost every computer language in existence rely on formal language theory.) The field strictly formalizes concepts like “language,” “grammar” and “word.” A formal “grammar” is basically a set of symbols and a set of rules that produce “words” from those symbols. The “language” is the set of all “words” produced by the “grammar.”

In the case of English, the symbols would be words (not letters), the rules would be English grammar, and the productions would be sentences or parts of sentences. English is far too complicated to describe with a formal grammar, but you can take a stab at it with rules like “a sentence is a subject, a verb, and an object,” along with rules defining “subject,” “verb,” and “object.” If you remember the lost art of diagramming sentences, you have the idea.

You can Google like crazy to pull up Wikipedia and other pages on these topics, but the only link I’ll give you is to György E. Révész’s Introduction to Formal Languages, a nice cheap Dover book that’s a crystal-clear explanation of the subject. It’s full of scary notation, but if you review your basic set notation and operations, and read the first chapter, it’ll make sense.

“Regular languages” are a particular type of formal language, the simplest in Chomsky’s hierarchy of formal languages. (Yes, that Chomsky.) He organized languages according to the types of rules that can produce them, or the types of (abstract) machines that can recognize valid statements in the language.

A regular expression is a way to specify a particular regular language; it’s a recipe for a machine that can recognize the language. Regular languages, being very simple, can be recognized by the most basic of these abstract machines, known as a “finite state machine.” These machines have no memory, only knowledge of their current state. Therefore, they cannot count and cannot back up, and are limited in the kind of strings they can match. They can’t recognize whether a string has properly balanced parentheses, or an equal number of Xs and Ys.

To balance parentheses, you need a more complex machine that has a limited form of memory: a stack which can store things, but in which only the top item can be examined at any given time. That machine, called a “pushdown automaton,” can recognize what are called “context-free” languages, and cannot be described with a regular expression. It can tell you that it found a closing parenthesis for every opening parenthesis it found, but it can’t tell you how many there were. And it still can’t find strings with equal numbers of Xs and Ys, because, unlike parentheses, the Xs and Ys don’t have to come in any particular order. With only one stack to hold things, it can’t keep track.

More fundamentally, it can’t really count. It can find a specific number of Xs, just by matching them one at a time, and it can even find an arbitrary number of Xs, by matching them over and over. (A finite-state machine can do both of these things as well.) But neither machine can truly count. For instance, they can’t find strings with the same number of Xs at the beginning and end, unless Xs are not allowed anywhere else in the string.

Tasks like that require a “linear-bounded automaton,” which has true memory, but the memory is limited to the size of the input string. So it can count, and can recognize what are called content-sensitive languages, where the rules must always produce longer strings from shorter strings.

Remove that limitation on the memory, and you have the most complex type of machine, a Turing machine. (Alan Turing devised it in order to complete the upending of mathematics begun by Kurt Gödel, in an era where a “computer” was a person who did arithmetic with a calculating machine). A Turing Machine, or an “unbounded linear automaton,” can recognize any recognizable formal language. It has been proven that there is no more powerful machine; any algorithm that can be performed, can be performed by a Turing machine.

Recognizing primes

So, it should be fairly clear that there is no possible way to write a regular expression that will recognize prime numbers. In fact, even Perl’s regexes, as powerful they are, cannot recognize a prime number. The “regex to recognize prime numbers” is actually a combination of a sophisticated and clever regex, with a Perl syntactic trick. Here’s the command:

perl -wle 'print "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/'

Let’s break it down:

  • perl -wle
    Calls perl with options that will allow it to process the rest of the command properly.
  • print "Prime" if
    Peculiar Perl syntax that in most languages would be written if (condition) then print "Prime"
  • (1 x shift) !~ /^1?$|^(11+?)\1+$/
    The condition for the “if” statement. The !~ operator means “doesn’t match.” To its left is a string to test, and to its right is a regex. It returns true if the string fails to match the regex. This is where the magic happens.

The regex (/^1?$|^(11+?)\1+$/), recognizes nothing but 1s. So how can it tell if “9” is prime? Because half the magic is not a regex, but a Perl syntactic trick. (1 x shift) uses Perl’s x operator, a string multiplier. In Perl, print = x 80 will print a sequence of 80 equal signs. And “shift” is a way to get the first argument passed to the command. So when you type the above command, followed by “9“, Perl evaluates “1 x 9” and produces 111111111. It’s just a shorthand way of writing a loop that prints a single character each time around.

This trick turns the number into the only thing the regex can work with, a string of 1‘s. The other half of the magic is how this regex recognizes it’s got a prime number’s worth of 1s.

To do so, it uses two Perl pattern-matching features that are impossible for true regular expressions: counting and backtracking. Read Neil Kandalgaonkar’s blog for the details, but basically, the regex matches a group of two 1s, then tries to keep matching groups of two 1s until the end of the string. If it fails, it backtracks and tries again with groups of three 1s. And so on.

So in this case, it finds two 1‘s, then tries to find them again and again, which it does, three more times. But when it’s done, it has a 1 left over, so the string didn’t match. The pattern-matcher starts over. It matches three 1‘s. Then it finds three more, and three more again, and it’s done.

Basically, the regex simply tries every possible divisor, starting with two, until it reaches a divisor the size of the original number (which fails because it wants at least two of those groups). As Neil points out, this is horribly inefficient at multiple levels, but it does work.

However, it is by no means a “regular expression that finds prime numbers.” Not only is this clever regex not a true regular expression, it also requires a Perl trick entirely unrelated to regexes in order to do its job.

This entry was posted in Parsing and Other Data Tricks, Programming and tagged , , , . Bookmark the permalink.

Leave a Reply