Working with Regular Expressions - Search Engine Optimization with PHP

1.

Open the^.htaccessfile you created earlier in your ^seophpfolder, and add the following rewriting rule to it:

RewriteEngine On

# Translate my-super.product.html to /product.php?product_id=123 RewriteRule ^my-super-product\.html$ /product.php?product_id=123 Original URL Rewritten URL

Products/P1.html product.php?product_id=1 Products/P2.html product.php?product_id=2 Products/P3.html product.php?product_id=3 Products/P4.html product.php?product_id=4

… …

# Rewrite numeric URLs

RewriteRule ^Products/P([0-9]*)\.html$ /product.php?product_id=$1 [L]

2.

Switch back to your browser again, and this time load http://seophp.example.com/

Products/P1.html. If everything works as it should, you will get the output that’s shown in Figure 3-7.

Figure 3-7

3.

You can check that the rule really works, even for IDs formed of more digits. Loading http://seophp.example.com/Products/P123456.htmlwould give you the output shown in Figure 3-8.

Figure 3-8

Congratulations! The exercise was quite short, but you’ve written your first regular expression! Take a closer look at your new rewrite rule:

RewriteRule ^Products/P([0-9]*)\.html$ /product.php?product_id=$1 [L]

If this is your first exposure to regular expressions, it must look scary! Just take a deep breath and read on: we promise, it’s not as complicated as it looks.

Appendix A contains a gentler introduction to regular expressions.

As you learned in the previous exercise, a basic RewriteRuletakes two arguments. In this example it also received a special flag — ^[L]— as a third argument. The meaning of these arguments is discussed next.

The first argument of RewriteRuleis a regular expression that describes the matching URLs you want to rewrite. The second argument specifies the destination (rewritten) URL. So, in geek-speak, the RewriteRuleline basically says: rewrite any URL that matchesthe ^Products/P([0-9]*)\.html$

pattern to /product.php?product_id=$1. In English, the same line can be roughly read as: “dele-gate any request to a URL that looks like Products/Pn.htmlto /product.php?product_id=n”.

Let’s analyze in detail the regular expression that describes the matching URLs that need to be rewritten.

That regular expression is:

^Products/P([0-9]*)\.html$

Most characters, including alphanumeric characters, are read literally and simply match themselves.

Remember the first RewriteRuleyou’ve written in this chapter to match my-super-product.html, which was mostly created of such “normal” characters.

What makes regular expressions so powerful (and sometimes complicated), are the special characters (or metacharacters), such as ^{^}, ^., or ^*, which have special meanings. Table 3-3 describes the metacharacters you’ll meet.

Remember that this theory applies to regular expressions in general. URL rewriting is just one of the many areas where regular expressions are used.

Table 3-3

Table continued on following page Metacharacter Description

^ Matches the beginning of the line. In our case, it will always match the beginning of the URL. The domain name isn’t considered part of the URL, as far RewriteRuleis concerned. It is useful to think of ^{^}as “anchoring” the characters that follow to the beginning of the string, that is, asserting that they be the first part.

. Matches any single character.

* Specifies that the preceding character or expression can be repeated zero or more times — not at all to infinite.

+ Specifies that the preceding character or expression can be repeated one or more times. In other words, the preceding character or expression must match at least once.

? Specifies that the preceding character or expression can be repeated zero or one time. In other words, the preceding character or expression is optional.

{m,n} Specifies that the preceding character or expression can be repeated between m and ntimes; mand nare integers, and mneeds to be lower than n.

Using Table 3-2 as reference, analyze the expression ^Products/P([0-9]*)\.html$. The expression starts with the ^{^}character, matching the beginning of the requested URL (remember, this doesn’t include the domain name). The characters ^Products/Passert that the next characters in the URL string match those characters.

Let’s recap: the expression ^Products/Pwill match any URL that starts with ^Products/P.

The next characters, ^([0-9]*), are the crux of this process. The ^[0-9]bit matches any character between 0 and 9 (that is, any digit), and the ^*that follows indicates that the pattern can repeat, so you can have an entire number rather than just a digit. The enclosing parentheses around ^[0-9]*indicate that the regular expression engine should store the matching string (which will be a number) inside a variable called ^$1. (You’ll need this variable to compose the rewritten URL.)

Finally, you have ^\.html$, which means that string should end in ^.html. The ^\is the escaping charac-ter, which indicates that the ^.should be taken as a literal dot, not as “any character” (which is the signif-icance of the ^.metacharacter). The ^$matches the end of the string.

The second argument of RewriteRule, /product.php?product_id=$1, plugs the variables that you stored into your script URL mapping, and indicates to the web server that requests by a browser for URLs matching that previous pattern should be delegated to that particular script with those numbers substituted for ^$1.

Metacharacter Description

( ) The parentheses are used to define a captured expression. The string matching the expression between parentheses can then be read as a variable. The parentheses can also be used to group the contents therein, as in mathematics, and operators such as ^*, ⁺, or ^?can then be applied to the resulting expression.

[ ] Used to define a character class. For example, ^[abc]will match any of the charac-ters ^a, ^b, ^c. The ^-character can be used to define a range of characters. For example, [a-z]matches any lowercase letter. If ^-is meant to be interpreted literally, it should be the last character before ^]. Many metacharacters lose their special function when enclosed between ^[and ^], and are interpreted literally.

[^ ] Similar to ^{[ ]}, except it matches everything except the mentioned character class.

For example, ^{[^a-c]}matches all characters except ^a, ^b, and ^c.

$ Matches the end of the line. In our case, it will always match the end of the URL.

It is useful to think of it as “anchoring” the previous characters to the end of the string, that is, asserting that they be the last part.

\ The backslash is used to escape the character that follows. It is used to escape metacharacters when you need them to be taken for their literal value, rather than their special meaning. For example, ^\.will match a dot, rather than “any character” (the typical meaning of the dot in a regular expression). The backslash can also escape itself — so if you want to match ^C:\Windows, you’ll need to refer to it as C:\\Windows.

The second argument of RewriteRuleisn’t written using the regular expressions language. Indeed, it doesn’t need to, because it’s not meant to match anything. Instead, it simply supplies the form of the rewritten URL. The only part with a special significance here is the ^$1variable, whose value is extracted from the expression between parentheses of the first argument of RewriteRule.

As you can see, this rule does indeed rewrite any request for a URL that ends with Products/Pn.html to product.php?productID=n, which can be executed by the product.phpscript you wrote earlier.

At the end of a rewrite rule you can also add special flags for the new URL by appending one or more flag arguments. These arguments are specific the RewriteRulecommand, and not to regular expres-sions in general. Table 3-4 lists the possible RewriteRulearguments. These rewrite flags must always be placed in square brackets at the end of an individual rule.

Table 3-4 RewriteRule Option

Significance Description

R Redirect Sends an HTTP redirect

F Forbidden Forbids access to the URL

G Gone Marks the URL as gone

P Proxy Passes the URL to mod_proxy

L Last Stops processing further rules

N Next Starts processing again from the first rule, but using the current rewritten URL

C Chain Links the current rule with the following one

T Type Forces the mentioned MIME type

NS Nosubreq The rule applies only if no internal sub-request is performed

NC Nocase URL matching is case-insensitive

QSA Qsappend Appends a query string part to the new URL instead of replacing it PT Passthrough Passes the rewritten URL to another Apache module for further

processing

S Skip Skips the next rule

E Env Sets an environment variable

Some of these flags are used in later chapters as well.

RewriteRulecommands are processed in sequential order as they are written in the configuration file.

If you want to make sure that a rule is the last one processed in case a match is found for it, you need to use the ^[L]flag as listed in the preceding table:

RewriteRule ^Products/P([0-9]*)\.html$ ./product.php?product_id=$1 [L]

This is particularly useful if you have a long list of RewriteRulecommands, because using ^[L]improves performance and prevents mod_rewrite from processing all the RewriteRulecommands that follow once a match is found. This is usually what you want regardless.

URL Rewriting and PHP

Regular expressions are also supported by PHP. Whenever you need to do string manipulation that is too hard to implement using the typical PHP string functions (http://www.php.net/strings), the regular expression functions can come in handy, as soon as you get accustomed to working with regular expressions.

PHP has many regular expression functions, the most common of which are listed in Table 3-5 for your convenience. The examples in this book use only ^preg_matchand preg_replace, which are the most frequently used, but it’s good to know there are others. For more information, visit http://www.php .net/pcre.

Table 3-5

PHP Function Description

preg_grep Receives as parameters a regular expression, and an array of input strings. The function returns an array containing the elements of the input that match the pattern. More details at http://www.php.net/

preg_grep.

preg_match Receives as parameters a regular expression and a string. If a match is found, the function returns 1. Otherwise, it returns 0. The function doesn’t search for more matches: after one match is found, the execu-tion stops. More details at http://www.php.net/preg_match. One detail worth keeping in mind is that the regular PHP string manipulation functions execute much faster than regular expressions, so you should use them only when necessary. For example, if you simply want to check whether one string literal is included in another, using strpos()or strstr()would be much more efficient than using preg_match().

Unlike with mod_rewrite, when you supply a regular expression to a PHP function you need to enclose the regular expression definition using a delimiter character. The delimiter character can be any character, but if the same character needs to appear in the regular expression itself, it must be escaped using the backslash (^\) character. The examples here use ^#as the delimiter. So if your expression needs to express a literal ^#, it should be indicated as ^\#.

Dans le document Search Engine Optimization with PHP (Page 80-86)