(a) Regex CRASH Course! (Pt. 1)

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

(a) Regex CRASH Course! (Pt. 1)

Post by Chris Corbyn »

Firstly can I just say that when researching regex (or Regular Expressions) you will notice a lot of reference to Perl. Perl was "one of" the first languages to heavily use regex, after Grep (unix tool) and so you'll find the most complete documentation for it here. Secondly, regular-expressions.info is a great resource for beginners.

Hold on tight this is going to be a fast paced, but to-the-point tutorial ;-)

Ready? Ok let's do it....

Code: Select all

 
/[\w\s]+\d{1,3}\t\W/
 
This is what makes developers cry. Look at that mess! What's all this \d, \s \w etc etc???

Lets start with metacharacters. Those \d, \s etc are what we refer to as "metacharacters". Metacharacters are characters which represent a particular group of real characters (some exceptions - see below).

The metacharacters and what they stand for:

Code: Select all

 
Character         Matching
 
. (dot)           ANY single character at all
\w                Any single alphanumeric character (a-z, 0-9) and underscores
\d                Any single digit (0-9)
\s                Any single whitespace character
 
<<Uppercase negates the metacharcter>>
 
\W                Any single non-alphanumeric character
\S                Any single non-whitespace character
\D                Any single non-digit
 
<<Something else to note>>
[x-y]             Any single character in the range x to z (e.g. [A-Z])
[abc123]          Any single character from a, b, c, 1, 2 or 3
[a-z125-9]        Any single character from a to z, 1, 2 or 5 to 9
 
<<Negate these with caret "^" at the VERY start of the bracket>>
[^abc0-9]         Any single character EXCEPT a, b, c and 0 to 9
 
Regex are case sensitive unless you specify otherwise. See further down for more info.

You'll see some other metacharacters which don't actually match anything thats really there. They match invisible boundaries so we call them "zero-width assertions".

Code: Select all

 
Assertion        Matching
 
^ (caret)        The start of the string
$                The end of the string
\b               A word boundary (the point between a non-alphanumeric character and an alphanumeric character)
 
There are others but you don't ever use them really.... read the Perl documentation if you want to know more.

Next, we can specify how many times a character should occur. We could do this to match a string of four digits:

Code: Select all

/\d\d\d\d/
Or we could write this:

Code: Select all

/\d{4}/
Lets cover the "quantifiers". The quantifier follows the character it applies to.

Code: Select all

 
Quantifier         Meaning
 
+                  One or more times
*                  None or more times
?                  None or one time
{n,m}              Between n and m times
{y,}               y or more times
{x}                x times only
 
One last thing before we build our first regex. Regex needs to be delimited if using Perl style regular expressions (preg_match()) which I strongly advise you do (Note: ereg_...() is not perl style).

To delimit a regex we start and end with the EXACT same character. The two standards are (but you can use most non-alphanumeric characters):

Code: Select all

 
/pattern/
#pattern#
 
Lets look at a regular expression before we move on further. We'll use preg_match() to execute the regex here (I'll explain after).

Code: Select all

 
$string = "Hello, I'm d11wtq and I'm 22 years old!";
if (preg_match("/\w+\W I'm \w\d{2}wtq and I'm \d+ years old\W/", $string)) {
    echo "d11wtq is 22";
} else {
    echo "d11wtq didn't tell me his age";
}
 
I'll explain what it does.
"\w+" matches an alphanumeric or underscore character one or more times
Hello
"\W" matches any single non-alphanumeric character
Hello,
" I'm " is just plain old string
Hello, I'm
"\w" is any single alphanumeric character
Hello, I'm d
"\d{2}" is two digits
Hello, I'm d11
"wtq and I'm " is just plain old string again
Hello, I'm d11wtq and I'm
"\d+" is one or more digits
Hello, I'm d11wtq and I'm 22
" years old" is plain old string
Hello, I'm d11wtq and I'm 22 years old
"\W" is any single non-alphanumeric charactcer
Hello, I'm d11wtq and I'm 22 years old!

If you understand that then let's move onto some "modifiers". If not, then read it again, and if you still don't get it, read it again.....

Note: When starting out in regex don't try and jump in with both feet. Match a tiny part of the string, then test it. Then add some more to your regex to match more of the string and test again. Repeat until the regex works.

Regex modifiers:

Code: Select all

/^pattern$/mis
"mis" here are all modifiers. They tell the regex how to behave.

Code: Select all

 
Modifier         Effect
 
i                Case insensitive
s                Ignore whitespace
g                Global search (not valid in PHP [use preg_match_all()] but handy if you're using JS or Perl). Tells the regex to keep looking after it's matched once
m                Multi-line mode (^ and $ now match start and end of LINE not start and end of STRING)
 
Again, there are others but you don't really use them.

Modifiers go on the right hand side of the closing delimiter.

Quick example:

Code: Select all

 
$string = "Hello World!";
if (preg_match('/^[a-z]/i', $string)) {
    echo "Starts with a letter";
} else {
    echo "Doesn't start with a letter";
}
 
"^" means match the very start of the string (not a character itself)
"[a-z]" means match a lowercase a to z
Nothing matched - BUT
The "i" modifier makes the regex case insensitive - SO
H is all that is matched but this means it returns true anyway.

There are some things you should remember when working with regular expressions.
1. Escape characters with a backslash
2. Remeber to use quantifiers to match multiple times
3. Remember to match a dot "." you need to escape it "\." because dot "." is a metacharacter itself
4. Regex are case sensitive by default
5. "*" and "+" are what we call "greedy" (Read the follow up to this tutorial to learn more)


Next... Parentheses have more than one use in regex. They:
a) Group characters together
b) Extract the characters they surround into memory (to match a parenthesis itself you must escape it "\(" )

Something useful:

Code: Select all

 
//Check string represents a URL
$string = "http://www.foo.bar/";
if (preg_match("#^\w+://(www\.)?\w+\.\w+#i", $string)) {
    echo "String is a URL";
} else {
    echo "String isn't a URL";
}
 
This matches the "http://www.foo.bar" part of the URL above so it returns true. I'll let you break it down yourself and see how it works (remember the parentheses "(....)" group the characters together ).

A vertical bar character "|" is used to mean OR.

Code: Select all

 
$string = "abcdefg123456";
//abcdefgh23456   OR   abcdefg123456
if (preg_match("/abcdefg(h|1)23456/", $string)) {
    //True
} else {
    //False
}
 
Ok we've nearly covered all the "basics" now. One last thing to cover in the scope of the crash course is extracting parts of the string into memory (then I'll finish up by briefly overviewing the PHP functions).

Sometimes you'll need to match part(s) of a string and extract them to use elsewhere. You do this using parentheses. Indexing starts at 1 and goes up by one for each parens used. The order follows this pattern with regards to nesting parens together:

Code: Select all

 
( 1 ( 2 ) ( 3 ( 4 ) ) ( 5 ( 6 ( 7 ) ( 8 ) ) ) ) ( 9 )
 
Essentially, you go deeper into the nest before moving further to the right.

The best way to refer to an extracted part of a string is by the dollar sign "$" followed by the index of the part you extracted. (e.g. "$4" ).
However, that said, PHP handles things slightly differently with the preg_match() function. Indexing starts at zero (the entire string) and then from 1 as expected for the extracted parts. preg_match() also requires a third parameter to do this so that it can dump "$1", "$2", "$3" etc into an array.

Code: Select all

 
$string = "There's a number in here 123456 somewhere but I don't know what it is!";
preg_match("/[a-z\s]+(\d+)[a-z\s]+/i", $string, $matches); //s a number 123456 in here somewhere but I don
echo "The number in the string is " . $matches[1]; //The number in the string is 123456
 
PHP functions overview:

preg_match() - I guess I have that one covered. Tests if the pattern is matched in the string. Returns TRUE if matched, FALSE if not. If the optional third parameter is given the function extracts parentheses enclosed parts of the pattern into a given array.

preg_match_all() - Same as preg_match() except that the regex doesn't stop when a match is found... it continues to find as many matches as exist in the string. The extracted array is a multi-dimensional array where all occurences of $1 are placed in $array[1] and all occurrences of $2 in $array[2] etc...

preg_replace() - Like str_replace() except it takes regex patterns as arguments:

Code: Select all

 
$string = "This is foo and that is bar";
$new_string = preg_replace('/f(\w+)/', "g$1", $string); //This is goo and that is bar
 
preg_split() - Like explode() except it takes a regex pattern as the point at which to split the string:

Code: Select all

 
$string = "lots of *@><& symbols &^% in this £! string";
$parts = preg_split('/[^\s\w]+/', $string);
print_r($parts);
/*
 
  Array (
      [0] => Lots of 
      [1] =>  symbols 
      [2] =>  in this 
      [3] =>  string
  )
 
*/
 
ereg() - Like preg_match() without the advantages if Perl style patterns and slightly slower (use preg_match() instead).

ereg_replace() - Like preg_replace() without the advantages if Perl style patterns and slightly slower (use preg_replace() instead).

I guess that covers all the basics of using regex but believe me there's a lot more than this to learn if you have got this under your belt first.

[I'll follow this crash course up with an advanced regex tutorial given some time to write it]

Good luck and happy regex'ing! :D
Last edited by Chris Corbyn on Mon Nov 28, 2005 7:02 am, edited 7 times in total.
User avatar
Burrito
Spockulator
Posts: 4714
Joined: Wed Feb 04, 2004 8:15 pm
Location: Eden, Utah

Post by Burrito »

awesome! thanks d11, this will help out a lot of peeps...myself included.

Burr
User avatar
John Cartwright
Site Admin
Posts: 11470
Joined: Tue Dec 23, 2003 2:10 am
Location: Toronto
Contact:

Post by John Cartwright »

dito. thanks
User avatar
Skara
Forum Regular
Posts: 703
Joined: Sat Mar 12, 2005 7:13 pm
Location: US

Post by Skara »

excellent work. The one thing I'd add is the bit about using /(?=meh)/. It's not used too much, but still nice to know.
User avatar
onion2k
Jedi Mod
Posts: 5263
Joined: Tue Dec 21, 2004 5:03 pm
Location: usrlab.com

Post by onion2k »

Awesome stuff.
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

Skara wrote:excellent work. The one thing I'd add is the bit about using /(?=meh)/. It's not used too much, but still nice to know.
Advanced tutorial will follow (time permitting). This was intended merley to allow people to get a hold of the basics ;-)
thedamo
Forum Newbie
Posts: 9
Joined: Fri Jul 15, 2005 7:23 am
Location: Sydney Australia

Great tutorial

Post by thedamo »

That was the best regular expression tutorial I have seen. I wish I had seen this along time ago..

Have you posted that advanced tutorial yet, if so where can I find it?

Keep up the good work :)
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Re: Great tutorial

Post by Chris Corbyn »

thedamo wrote:That was the best regular expression tutorial I have seen. I wish I had seen this along time ago..

Have you posted that advanced tutorial yet, if so where can I find it?

Keep up the good work :)
Haven't really had time to knock one up (and yeah ok.... I forgot :P).

Hey it's saturday, I guess that's something I could do. I have a half written regex quiz too (40 questions so far) ;)
User avatar
neophyte
DevNet Resident
Posts: 1537
Joined: Tue Jan 20, 2004 4:58 pm
Location: Minnesota

Post by neophyte »

Nice work. Regex is a weekness of mine -- tutorial is appreciated.

Thanks
User avatar
anjanesh
DevNet Resident
Posts: 1679
Joined: Sat Dec 06, 2003 9:52 pm
Location: Mumbai, India

Post by anjanesh »

Excellent Tutorial.
It'll be cool if someone posted a tutorial for RegEx in Apache's mod_rewrite or just the differences from the normal RegEx (Like mod_rewrite doesnt support ?).
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

mod_rewrite is posix regex, last I checked.
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

For those who've asked for the followup. I've knocked one up like I said. I forced myself into it tonight since I would never have gotten around to it otherwise :D

viewtopic.php?t=40169
foobar
Forum Regular
Posts: 613
Joined: Wed Sep 28, 2005 10:08 am

Post by foobar »

Bah! Real men use the PHP Manual as reference and nothing else. :wink:

Nice tutorial, pretty noob friendly compared to abovementioned manual entry... solid work.
User avatar
n00b Saibot
DevNet Resident
Posts: 1452
Joined: Fri Dec 24, 2004 2:59 am
Location: Lucknow, UP, India
Contact:

Post by n00b Saibot »

foobar wrote:Bah! Real men use the PHP Manual as reference and nothing else. :wink:
I second that :P
abalfazl
Forum Commoner
Posts: 71
Joined: Mon Sep 05, 2005 10:05 pm

The best

Post by abalfazl »

Hello firends

In fact,That was the best regular expression tutorial I have seen.

Thank you very much for that

GOOD LUCK!
Post Reply