RegEx

The abbreviations regex and regexp (regular expression) denote regular expressions which are used in theoretical computer science, programming, software development, word processing and search engine optimization. Regular expressions can be used to describe strings and numbers of strings in a general logical form in order to search replace, manipulate or further process them in documents, source code or a database.

Example: In a regex-enabled text editor, all the links contained in an HTML file are supposed to be displayed. If the expression <a href="[^"]*"[^>]*> is entered into the search function of the editor, all links will be displayed which have the usual format for HTML links. The term <a href=".*?".*?> performs the same task.

General information

The logician and mathematician Stephen Kleene is considered to be the founder of regex. In 1956 he used notations of regular amounts in an essay on the representation of events in neural networks and finite automata. This and other works are today fundamental basics in theoretical computer science. Regular expressions are now used in various fields to simplify operations, which would otherwise be very work-intensive and time-consuming.

Regex can be used depending on their implementation in several programming languages, environments and text editors, for example, in Perl, PHP, .NET or JavaScript as elements of a library.[1] Or in EditPad, Emacs and Notepad ++ as a search and replace function in text editors. In Google Analytics regular expressions are also utilized to filter traffic sources, define segments and separate detailed report data from other data.

Functionality

The uses of regex are extremely diverse. The possible regular expressions depend on the notation. There are different notations in different programming languages. These notations are called shell pattern name, BRE (Basic Regular Expressions) and ERE (Extended Regular Expressions). The differences are sometimes due to the fact that individual characters and especially metacharacters (control characters) are used in a programming language.

Generally, (terminal) characters and metacharacters are distinguished. The characters are recorded in the character set (the alphabet) which contains, for example, numbers, letters and commas. The metacharacters are specified operations such as alternation |, linking () and [] and repeats with *, + and?. With ^ quantities can be negated. Metacharacters are instructions for the processing software. Regular characters can be in front or behind the metacharacters, their formal meaning will then be different. Most implementations work with a special regex engine that parses and interprets the listed regular expressions and checks resources for instances.

• Regular character: All numbers from 0 to 9. All the letters of an alphabet, and some special characters (commas, dashes, semicolons). Important: The alphabet depends on the character set used (for example, Unicode or ASCII).
• Character classes: \d is, for example, a number from 1 to 9. While \t would find all tabulators. Other options are \l for lowercase, \s for all spaces or \u for all uppercase letters.
• Metacharacters:
 [] () {} | ? + - * ^ \$ \
With a backslash set before it, a metacharacter can be canceled.

Practical relevance

The following methods may be implemented with regular expressions:

• Pattern matching: By using a string matching algorithm, texts can be checked for the occurrence of patterns. In this case a regular expression stands for a set of strings with its occurrences reconciled in the text. The regex expression specifies the pattern, the engine checks the pattern against a resource (for example, an HTML document or a text). Under certain circumstances, a replacement rule may be specified to directly change the strings found. Quantifiers can be used to narrow down the results. Examples: the verification of an entered email address as to its formal correctness, or the search for top-level domains in a list of URLs.
• Globbing: File names are added to placeholders to select all the files in a particular format, for example. The wildcard “sample.*” would find all files in a file management system that start with “sample,” but are different file formats such as .txt. or .doc. The asterisk represents the variety of file formats. Globbing is also used in denial of service attacks where servers are being intentionally overloaded.[2]
• Truncation: In database searches, search terms are often abbreviated or truncated using wildcards. The term sample* would find all terms that begin with sample and end with other letters, such as, sample matching, sample testing or sample example. By truncation, the search space is enlarged. Example: In a library search, all entries could be found that contain a specific search term.
• Stemming: In stemming, different morphological variants of a word are attributed to the word stem. Denials and conjugations of words can thus be reduced to their linguistic stem or root word. This method is used in information retrieval (for example, through search engines) and theoretical computer science. Example: Google probably uses a similar process in the context of organic search.[3]

Importance for search engine optimization

Regex can be extremely useful for some work in the area of search engine optimization.[4] Tracking and analysis tools such as Google Analytics have an application for regex.[5]

In Google Analytics, regular expressions serve to set filters for IP addresses. Individual filters can be defined in the profile settings by excluding the IP addresses of one or more visitors. Thus, the traffic from a range of IP addresses will not be included in the reports. This is useful if you want to exclude irrelevant visits from visitor statistics, such as your own employees.

Different segments can also be processed in Google Analytics using regex. For example, searches can be excluded that contain a brand name. For this purpose, a segment would be defined, which includes only organic traffic and not the brand name that has been defined in advance using regex: “[mm] sample company” for spellings with uppercase and lowercase letters. Different types of keywords can likewise be excluded to find out how much traffic is generated with two or three specific keywords. The same applies to traffic from other sources such as newsletters, emails and link partnerships of external websites.

A similar tactic can be useful for the monitoring of social media channels. A source would be defined in this case by specifying the possible sources with the regular expression. For example, “facebook|twitter|youtube|LinkedIn.” Google Analytics is not the only thing offering a number of options that can be exploited with regular expressions.[6] Weblogs and server environments can interpret and process regex as well. Thus websites can be redirected and labeled as canonical through certain patterns that are described by regex.[7]

References

1. Tools regular-expressions.info. Accessed on 09/11/2015
2. globbing definition searchsecurity.techtarget.com. Accessed on 09/11/2015