Lesson 8	Regular expressions
Objective	Regular expressions define patterns

Use Regular Expressions to define Patterns

Regular expressions (regex) are a powerful tool for defining and matching patterns in text. They consist of a sequence of characters that specify a search pattern, which can be used for tasks like validation, searching, and replacing text. Below, I'll explain how you can use regex to define patterns, along with examples and common components.
Basics of Regular Expressions: A regular expression is written as a string of characters, where some characters have special meanings (metacharacters), while others are literal. You can use them in programming languages like Python, JavaScript, or tools like grep or text editors.
Key Components of Regex Patterns

Literal Characters: Match exactly what’s written.
- Example: cat matches "cat" in "The cat is here."
Metacharacters: Special characters with specific meanings.
- . : Matches any single character except newline.
- * : Matches 0 or more occurrences of the previous character.
- + : Matches 1 or more occurrences.
- ? : Matches 0 or 1 occurrence.
- ^ : Matches the start of a string.
- $ : Matches the end of a string.
- | : Acts as an OR operator.
- [] : Defines a character class (e.g., [a-z] matches any lowercase letter).
- () : Groups parts of the pattern.
Quantifiers: Specify how many times a character or group should appear.
- {n} : Exactly n occurrences.
- {n,} : At least n occurrences.
- {n,m} : Between n and m occurrences.
Escape Sequences: Use \ to escape metacharacters or denote special sequences.
- \d : Matches any digit (0-9).
- \w : Matches any word character (a-z, A-Z, 0-9, _).
- \s : Matches any whitespace (space, tab, etc.).
- \b : Matches a word boundary.

How to Define Patterns To define a pattern, combine these elements based on what you want to match. Here’s a step-by-step guide with examples:

Simple Literal Match
- Pattern: hello
- Matches: "hello" in "hello world"
- Use case: Find exact words or phrases.
Match Variations with Character Classes
- Pattern: [A-Za-z]+
- Matches: Any sequence of one or more letters (e.g., "Apple", "cat").
- Explanation: [A-Za-z] matches any letter, + means "one or more."
Match Specific Formats (e.g., Phone Number)
- Pattern: \d{3}-\d{3}-\d{4}
- Matches: "123-456-7890"
- Explanation: \d{3} matches three digits, - is literal, repeated for the format.
Match Optional Elements
- Pattern: colou?r
- Matches: "color" or "colour"
- Explanation: ? makes the "u" optional.
Match Start or End of String
- Pattern: ^start|end$
- Matches: "start" at the beginning or "end" at the end of a string.
- Example: Matches "start here" or "this will end".
Complex Patterns (e.g., Email Validation)
- Pattern: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
- Breakdown:
  - ^ : Start of string.
  - [a-zA-Z0-9._%+-]+ : Username (letters, numbers, some symbols).
  - @ : Literal "@".
  - [a-zA-Z0-9.-]+ : Domain name.
  - \. : Literal dot.
  - [a-zA-Z]{2,} : Top-level domain (e.g., "com", "org").
  - $ : End of string.
- Matches: "[email protected]"

Practical Examples in Python Here’s how you’d use regex in Python with the re module:
Example 1: Find all words

python
import re

text = "The quick brown fox jumps over 42 lazy dogs."
pattern = r"\w+"
matches = re.findall(pattern, text)
print(matches)  # ['The', 'quick', 'brown', 'fox', 'jumps', 'over', '42', 'lazy', 'dogs']

Example 2: Validate an email

email = "[email protected]"
pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
if re.match(pattern, email):
    print("Valid email")
else:
    print("Invalid email")

Tips for Defining Patterns

Start Simple: Begin with literal matches, then add complexity.
Test Your Patterns: Use tools like regex101.com to experiment and debug.
Be Specific: Avoid overly broad patterns (e.g., .*) unless necessary.
Anchor When Needed: Use ^ and $ to ensure exact matches.

Command Parameters

As you use UNIX commands, you will frequently want to refer to a set of files rather than to a single file. To do this, you use a wildcard character. For example, using the asterisk wildcard character, the expression *doc can be used in a command to include all filenames ending with “doc.” This is just one use of the UNIX feature called regular expressions. Regular expressions allow complex patterns to be expressed using special characters. These patterns are used in many commands to refer to filenames, text within files, or other pieces of data. By learning to use regular expressions, you can concisely express exactly which file or data you want to work on. The following are five basic codes can be used in any regular expression:

* An asterisk represents zero or more characters. It does not matter which characters.
. A period represents exactly one character. This can be any character.
[ ] Characters within brackets are treated as a list; any single character within the brackets can be used to match the string that is being searched
^ A carat ties a pattern to the beginning of a line (used for text within files, not for filenames)
$ A dollar sign ties a pattern to the end of a line (used for text within files, not for filenames)

Special Characters
The examples below should help you to use each of these special characters.
Suppose you want to refer to all filenames that end with the file extension tgz. The following pattern will list them:
```
*tgz
```
A period is helpful when you know how many characters are part of the pattern you are searching for, but you do not know which ones will be used. For example, you want to search for the word “complimentary,” but you think it may be spelled as complementary or even complementery in some places. This regular expression will find all three words: compl.ment.ry
Suppose you need to match a filename that starts with the word Chapter, but may begin with an upper or lowercase “C.” The following regular expression will match filenames beginning with both Chapter and chapter:
```
[cC]hapter*
```
Or suppose you needed to search a file for all the lines that begin with the word “Thus.” The following pattern would do the trick (with uppercase and lowercase included for good measure): ^[Tt]hus
A dollar sign ties a pattern to the end of a line of text. For example, if you are searching a list of records with a state code at the end of each line, you could use this regular expression to search for lines ending with the Vermont state code,
```
VT: VT$	
```

“escape” a special character

Often you will want to “escape” a special character within a regular expression to indicate that it should be interpreted literally and not used as a special pattern indicator. For example, to search for the character “*”, you must include \* in your search expression. The backslash indicates that the following character, the asterisk, should be interpreted literally, and not as the wildcard character that represents zero or more characters. Obviously, regular expressions can become quite complex and even difficult to interpret at first. But they are used in many situations, from simple ls and grep commands to scripting with awk, Perl, and even C programming. The following diagram includes several sample regular expressions with pop-up explanations of what the expression will match within the context of the given command.

Common Regular Expressions — Matches the word VISA in any combination of upper and lowercase letters

Matches both principle and principal (to catch misspelled words) as well as other words starting with princip and two more letters

Matches a dollar sign followed by one digit

Matches any word starting with Fig or fig

Matching any word that starts with The and is the first word on a line

Matches any word with an “a” before a “c” (and zero or more letters between them)

Bash shell does not support the +([0-9])-type regular expressions.

Regular Expression - Exercise

Click the Exercise link below to practice what you have learned in the UNIX Lab.
Regular Expression - Exercise
In the next lesson, we wrap up this module.

Regular Expression - Quiz

Click on the Quiz link below to test yourself on using regular expressions.
Regular Expression - Quiz