Corning Community College

UNIX/Linux Fundamentals


Lab #8: Pattern Matching with Regular Expressions




Objective

To become familiar with Regular Expressions and their applications.

Reading

There are many documents available on Regular Expressions. Check out some of the following:

In "Harley Hahn's Guide to UNIX and Linux", please read Chapter 20, "Regular Expressions", on pages 497-519.

For anyone with access to the book "UNIX in a Nutshell, 3rd Edition", Chapter 6 in the book deals with Regular Expressions on pages 295-301.

Background

Back in Lab #4, you were introduced to file wildcards: *, ?, and [ ]. As you should have experienced, this useful functionality of the UNIX shell allows for some fairly precise filename searching.

Now, with UNIX being as flexible as it is, this same "wildcard" functionality can also be applied to text processing. That's right! You can also search through streams of text for occurrences of letters, specific words, or specific places in the text.

These Regular Expressions are as follows:

Reg. Exp.
Description
.
Match any character
*
Match 0 or more of the preceding
^
Beginning of line or string
$
End of line or string
[ ]
Character class - match any of the enclosed characters
[^ ]
Negated character class - do not match any of the enclosed characters
\<
Beginning of word
\>
End of word

NOTE: With character classes, you can specify ranges, such as the uppercase alphabet: [A-Z]

Sets of ranges are concatenated, no need for commas. For example, both the lowercase alphabet and any numeric digit: [a-z0-9]

The first five are considered the basic Regular Expressions. All programs that support Regular Expressions will support these basic types.

If you include the ^ inside the character class, it will change the function to exclude any of the enclosed characters, for example:

[^abcd]

Is a character class that will not match any of a, b, c, or d.


Procedure

The grep(1) utility is extremely useful in the area of text-searching, and Regular Expressions. We will be calling upon the capability of this tool quite often, so let us take a look at it:

1. Using the grep(1) utility in the /etc/passwd file, perform the following searches:
a.
Perform the following search:
$ grep 'System' /etc/passwd
b.
What does this search do?

As you can see, grep(1) can be used to search for literal text strings, but it can also be used to search based upon a pattern:

2. Do the following search:
a.
Perform the following search:
$ grep '^[b-d][aeiou].*' /etc/passwd
b.
What does this search do?
c.
How is it more powerful than just a literal string?

And of course, the more practice you have, the better off you are.

3. Perform the following searches (still using /etc/passwd), and indicate what you did:
a.
Search for all the lines starting with any of your initials (first or last). Be sure to include command used, and matching lines.
b.
Search for all the lines starting with r, followed by any lowercase vowel, and ending with an h. How did you do it? What were your results?

*Note: Be sure to use quotes around the regular expressions arguments to grep. It helps to differentiate between grep regexp's and shell wildcards. The single quotes are most preferable, as you are specifying a literal string.

The less pager can also be used with Regular Expressions. The on-line manual pages are setup to use the default pager, which should be less in most cases.

To search in less (and therefore the manual pages), use the forward slash /, followed by your search pattern.

Finally, we take a look again at the vi editor, which has some very powerful functionality when dealing with Regular Expressions. The substitute function, an ex command, can be quite useful.

The basic syntax is as follows:

:[address]s[/pattern/replacement/][options][count]
Where address can be a number of lines- % for the entire file, or any other valid addressing scheme used in vi.


pattern is the text you are searching for (which can include regular expressions)
replacement is the text that will replace the text found by pattern
and finally, options can be one of c, g, or p. (g = global)

In the regex/ subdirectory of the UNIX Public Directory you will find a file called regex.html, which is a copy of lab #0, with some changes. Looking through this file, you will see several HTML tags. Having to make changes to this file could result in massive changes, so why worry about doing it by hand? Let Regular Expressions help!

4. Do the following (be sure to show the substitution command used):
a.
Oops! I made a typo! All the <center> tags are spelled British style as <centre>. Go ahead and correct this for all occurrences in the entire file.
b.
The closing center tags are currently </CENTRE>, so go change them to </center>. Be sure to properly handle the /.
c.
This file uses the old <b>-style boldness tags. We want to be fairly modern and use <strong> instead. So go ahead and get that all set.
d.
Go ahead and make the appropriate changes to all the </b> tags to their corresponding </strong> counterparts.
e.
Attach the updated file to an e-mail with your lab submission.

Imagine if you had a massive file in need of changes? Would you want to spend hours doing it all by hand? Or construct a simple RegEx pattern and have the computer do the work for you? THAT is the power of Regular Expressions.

5. Time for some more fun:
a.
Change into the /usr/share/dict directory and locate the 'words' file.
b.
Do you see it?
c.
View this file... how does the file appear to be made up?
d.
How many words are in this file?
e.
Be sure to show me how you counted.

Using this dictionary, I'd like for you to perform some searches, aided by Regular Expressions you construct. Be sure to show your pattern, as well as provide a count of how many words match your pattern.

6. Construct a RegEx according to the following criteria:
a.
All words exactly 5 characters in length
b.
All words starting with any of your initials (note: not ALL your initials)
c.
All words starting with your first initial, having your middle initial occur somewhere after the first, and end with your last initial.
d.
All words that start and end with lowercase vowels.
e.
All words that start with any of your initials, immediately followed by any lowercase vowel, and ending with the letters e, s, or t.
f.
All words that do not start with any of your initials.
g.
All words at least 3 characters in length, and do not start with 'th'.
h.
All 3 letter words that end in e
i.
All words that contain the string bob but do not end with the letter b.
j.
Only the words that start with the string "blue".
k.
All the words that contain no vowels.
l.
All the words that do not begin with a vowel, that can have anything for the second character, only a, b, c, or d for the third character, and end with a vowel.

It is important to understand the nature of RegEx and the patterns they create. We will be using this knowledge when we wish to perform advanced searches, and in shell scripting. So be sure to ask any questions if you don't understand something.


Conclusions

All questions in this assignment require an action or response. Please organize your answers into an easily readable format and be prepared to submit the final results to your instructor.

Your assignment is expected to be performed and submitted in a clear and organized fashion- messy or unorganized assignments may have points deducted. Be sure to adhere to the submission policy.

When complete, electronically submit your assignment using the "Assignment Submitter", located here:

http://lab46.corning-cc.edu/haas/spring2009/unix/submit/submit.php?lab8

As always, the class mailing list is available for assistance, but not answers.

Last Updated: October 27, 2008