An Introduction to Regular Expressions

Regular Expressions are definitely one techie level up from your traditional tech tip, but they're definitely worth the time to learn, even if you only learn the basics. A regular expression is very much like a math formula, and you use them when you want to find (and replace) pieces of text using a condition instead of knowing the text in advance.

Conceptual Examples

For example, let's say you have a file that contains a bunch of phone numbers. And let's say those phone numbers are all written out as "8005551212", but you want them to look like "(800) 555-1212." Using a text editor that supports a regular expression Find & Replace, you could easily reformat all of those phone numbers to include parentheses and a dash, without going row by row to manually change them all. Since you know that your phone number is a string of 10 continous digits, you can tell your text editor to find all instances of 10 numbers in a row, and to insert a '(' before the first digit, a ') ' after the third, and a '-' after the sixth.

A second example is as follows. You have a database full of vfx shot names and shot durations. You also have a sequence full of vfx shots you need to turnover, and every one of them needs a title added to it dictating the shot name and how many frames it is. You can export the information you need from your database, but only as a comma-separated values file (.csv). This would give you results such as:

VFX_001,25 VFX_002,38 VFX_003,119 VFX_004,350 VFX_005,8

To create all those titles, you know that you can use the Autotitler function in Avid Marquee, but the text format it requires is different from CSV, resembling something like this:

VFX_001 25

VFX_002 38

VFX_003 119

VFX_004 350

VFX_005 8

So to reformat your CSV file into the format that Marquee requires, you can tell your text editor to replace every comma in your file with a line break, and to turn every pre-existing line break into a double line break. You can do this with one regular expression that both replaces the comma and adds a second line break, but I sometimes like to break it up into separate steps to keep things simple.

To demonstrate how to do this find/replace, I'm going to double the line break before I replace the comma. This way I can be sure that I don't add more line breaks than I need. So the first find and replace would look like this:

Find: \n Replace: \n\n

In regular expressions, "n" is the notation you can use for line breaks (sometimes also "r" is used either instead of or in conjunction with "n", but you can google the difference on your own). So what this find/replace does is search for a line break and replace it with two. Then, you can probably guess what to do with the commas:

Find: , Replace: \n

This will give you the format you need for the Avid Autotitler.

Lastly, you can also use Regular Expressions in many file renaming utilities (A Better Finder Renamer is one I use), so if you need to rename a bunch of files in order to conform to a certain pattern, regular expressions can help. One instance where you might use this would be to conform a bunch of irregularly named files in order to put them in sequence for import into an Avid bin.

Regular Expression "Variables"

What the example above is intended to demonstrate is the concept of searching for a pattern of text, rather than knowing what text you're searching for in advance. And in order to search for patterns, you must be able to use placeholders to represent certain characters or groups of characters.

This Regular Expression Reference lists the different placeholders you can use when searching text. The ones you'll use most often are:

  • \d : Finds any numerical character (ie. 0-9)
  • \w: Finds any word, with a word being defined as a group of alphanumeric characters or an underscore, but not including a space
  • \s: Finds any whitespace, including a space, tab, or line break
  • \t: Finds any tab character
  • [ and ] : If you wish to limit the characters you're searching for, put those characters inside of [ and ]. So for example, [A-Za-z5-8] would find any character from A-Z regardless of uppercase or lowercase, as well as any number between 5 and 8

You will often need to specify how many characters you're searching for, in which case you'll need these basic placeholders:

  • ? : A question mark after a character or character class denotes that you are looking for 0 or 1 instance of that character
  • * : An asterisk denotes you are looking for 0 or more of that character
  • + : A plus sign denotes you are looking for 1 or more of that character
  • { and } : These brackets allow you to say exactly how many characters you want to match. For example, "\d{2}" tells the program you're searching for a string of exactly two digits. "\d{2,8}" tells the program you're searching for between 2 and 8 digits, and "\d{2,}" specifies that you're searching for at least 2 digits.

And lastly, you've seen the backslash (\) used a lot here, and that's worth explaining. In regular expressions, the backslash functions as what's called an escape character. The rules of regular expressions are a bit complex, and many characters you may want to search for have functional meanings, like the fact that an asterisk (*) tells the program to match 0 or more characters. If you want to search for an asterisk, though, you may need to escape it. And you do that by putting a backslash before the asterisk, like so: \* . By using the backslash, you are either telling the program to ignore the special meaning that a particular character has, or to match a character that is not easily defined (like \t, which represents a tab character).

Back References

The last concept I want to explain can be tricky to get your head around while you're still digesting everything else, but it's a very useful thing to know, and is called a back reference. Let's take the timecode example below... In this situation, you have a bunch of timecodes without colons (:) separating the hours, minutes, seconds, and frames (ie. 01020304). You want to insert the colons, but you need a way to tell the program not to throw out the digits that make up the timecodes when replacing the timecode text. So to do that, you have to save those digits during the Find part of the process for use during the Replace part. You do this by enclosing the text you want to save in parentheses, as so: (\d{2})

Then, in your Replace expression, you can tell the program to insert the text it's saved by including $1, $2, $3, and $4. The first parentheses in your Find expression are referenced by $1, the second by $2, and so on. And when replacing the timecodes, if I put a set of parentheses around every 2 digits, that will allow me to then insert colons between those pairs of digits, thus giving me properly formatted timecode in the form of 01:02:03:04.

Examples

The easiest way, I think, to grasp how you use all these placeholders is to show you some examples, some of which come from this Regular Expressions site.

Email Address:

This will match most email addresses,

\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}\b

and is broken down like this:

  1. \b matches a word boundary (most likely a space)
  2. [A-Za-z0-9._%+-]+ matches any alphanumeric character, regardless of case, as well as the punctuation also enclosed within the brackets. The + sign at the end states that you are looking for 1 or more characters that match this pattern, since most email addresses are more than one character long.
  3. @ simply matches the @ sign in an email address
  4. [A-Za-z0-9.-]+ will match the server name in your email address (ie. it will match the "gmail" in "me@gmail.com")
  5. \. will match the dot between your server name and your top-level domain (ie. it will match the "." in "me@gmail.com")
  6. [A-Za-z]{2,4} will match the .com, .org, .net, .info, or whatever you happen to have, by matching 2-4 alphabetical characters
  7. \b again matches a word boundary, presumably a space or line break

Timecode:

This will match timecode, which I've used in the past to reformat a subtitle file from an Excel-exported CSV into a DVD Studio Pro formatted .stl file. Below is my source file, which as you see is missing the ":" in all of the timecodes. The .stl file requirements ask that there be a space on eiher side of the commas separating the TC and subtitle text, which I did in Excel by concatenating several cells into one column with the appropriate comma spacing.

Source CSV File:
01052021 , 01052328 , …but now I wonder if it is just the fear talking.
01052419 , 01052800 , I'd like to say I'm the son of a famous person.
01052812 , 01053106 , Or at least someone who is politically affiliated.
01053128 , 01053406 , But that is not the truth.
01053427 , 01053619 , So the only reason I can  think of…
01053800 , 01053922 , …is money.
01055425 , 01055914 , -Be reasonable. | -Don't you understand we make the rules here?
01060003 , 01060209 , We will give you exactly what you want.
01060210 , 01060517 , -Make sure of it. | -It has been 19 days!
01060518 , 01060625 , That doesn't matter anymore.
01060626 , 01060815 , It does matter.
01061309 , 01061603 , I already gave you till noon.
01062605 , 01062800 , Don't do this.
01063622 , 01063729 , Wait.

To find and replace the timecodes, I would use these patterns:

Find: (\\d\{2\})(\\d\{2\})(\\d\{2\})(\\d\{2\})
Replace: $1:$2:$3:$4

Breakdown:

  1. (\d{2}) searches for strings of 2 digits, and the fact that the \d{2} is within parentheses means that the program will save the two digits it finds so that I can reinsert them when replacing the text. Since I know my timecode is 8 digits long, I put four of these statements in a row so that I can keep the hours, minutes, seconds, and frames separate.
  2. $1:$2:$3:$4 replaces every 8-digit timecode string the program finds with the first two digits it saved from the parentheses, followed by a colon, followed by the second pair of digits, then a colon, etc. This is a back reference, as mentioned above.

Conclusion

I'll add to this article as new examples and uses arise, but hopefully this and Google will get you started on figuring out all the different ways you can use Regular Expressions. If you're confused about when to use them, just stop yourself when you find that you're in a position of having to make a bunch of tedious edits to a text file. It may be that you can save yourself a lot of time and typing by using a Regular Expression Find/Replace.

Additional References:

PHP: Regular Expression Details