To become familar with the concepts of text filtering, and some of the UNIX utilities that are useful in this process.
|
Corning Community College UNIX/Linux Fundamentals |
To become familar with the concepts of text filtering, and some of the UNIX utilities that are useful in this process.
In "Harley Hahn's Guide to UNIX and Linux", please read:
Filtering is a big deal in many areas that deal with information processing. Say you've got a database of produce for a grocery store, and you want to view JUST the information regarding the banana shipments.. instead of sorting through the entire database and picking out the data you want manually- why not put all the data through a filter and simply view the pertinent data?
A filter, as defined on http://dictionary.reference.com/ is as follows:
A program or routine that blocks access to data that meet a particular criterion [for example]: a Web filter that screens out vulgar sites.
UNIX provides some utilities that allow you to accomplish impressive amounts of filtering. When coupled with Regular Expressions, you can combine the power of pattern matching with your filter, adding considerable flexibility to your arsenal of tricks.
The next step is to apply shell scripts, which allow you to write "programs" that can take advantage of all the utilities and features available on a system. We will be looking at shell scripts another week. First we must get the foundations in place so we can better appreciate shell scripts.
So, some basics of filtering:
In order to do any sort of filtering, we need to know what we want to filter. Makes sense.
Before we employ filtering, we must have some clear idea about what we would like to filter, and how to safely maintain the data we wish to let through. (Filtering is no good if the data you are after gets damaged in the process).
The UNIX cat(1) utility is a general all-purpose tool that can be used to display the contents of text files. cat(1) also provides a number of other features that can be handy for debugging problems you may encounter with text files. (the -n and -e arguments can be particularly useful).
Let's play with a sample database. In the filters/ subdirectory of the UNIX Public Directory you will find a file called "sample.db". Let's try some stuff out:
Display the contents:
This is the simplest form of filtering possible-- none at all. All the data in the text file is passed to STDOUT.
Even at this stage we can do some useful things with the data. For example, if we wanted to find out how many lines were in the database:
The database will display to STDOUT in all its entirety. You will notice the database is setup as follows:
To be effective in filtering text, we must be aware of the structure of that text. The more you know about how some structure is set up, the better we can design a solution to the particular problem.
Ok, so let us filter some of this information:
Find all the students who are in Biology:
We can do more complicated searches too:
Find all the students who are in Biology AND like Lollipops:
Making sense? Let's have some fun:
| | Find all the students that are a Freshman |
| | Same as above but in alphabetical order |
| | Any duplicate entries? Remove any duplicates. |
| | Using the wc(1) utility, how many matches did you get? |
Be sure to give me the command-line incantations you came up with, and any observations you made.
So we've done some simple searches on our database. We've filtered the output to get desired values. But we don't have to stop there. Not only can we filter the text, we can manipulate it to our liking.
The cut(1) utility lets us literally cut columns from the output.
It relies on a thing called a field-separator, which will be used as a logical separator of the data.
Using the "-d" argument to cut, we can specify the field separator in our data. The "-f" option will parse the text in fields based on the established field separator.
So, looking at the following text:
hello there:this:is:a:bunch of:text.Looking at this example, we can see that ":" would make for an excellent field separator.
With ":" as the field separator, the logical structure of the above text is logically represented as follows:
+-------------+------+----+---+----------+-------+ | hello there | this | is | a | bunch of | text. | +-------------+------+----+---+----------+-------+ | f1 | f2 | f3 |f4 | f5 | f6 | +-------------+------+----+---+----------+-------+
We can test these properties out by using cut(1) on the command-line:
Where # is a specific field or range of fields. (ie -f2 or -f2,4 or -f1-3)
| | What would the following command-line display: |
| | If you wanted to get "hello there text." to display to the screen, what would you have to do? |
| | Did your general attempt work? Is there extra information? |
If you found that extra information showed up when you tried to do that last part- taking a closer look will show why:
If you tell cut(1) to display any fields that aren't immediately next to one another, it will insert the field separator to indicate the separation.
So how do you keep this functionality while still getting the exact data you seek? Well, nobody said we could only apply one filter to text.
The Stream Editor - sed
Remember back when we played with vi? Remember that useful search and replace command-- :%s/regex/replacement/g
That was quite useful. And luckily, we've got that same ability on the command line. Introducing "sed", the stream editor.
sed provides some of the features we've come to enjoy in vi, and is for all intents and purposes a non-interactive editor. One useful ability, however, is its ability to edit data streams (that is, STDOUT, including that generated from our command lines).
Perhaps the most immediately useful command found in sed will be its search and replace, which is pretty much just like the vi variant:
However, if you look close, you will see that we did not include any sort of file to operate on. While we can, one of the other common uses of sed is to pop it in a command-line with everything else, stuck together with the uber-powerful pipe (|).
For example, so solve the above problem with the field separator:
We used sed to replace any occurrence of the ":" with a single space.
| | Does the above command-line fix the problem from #2c? |
| | If you wanted to change all "t"'s to uppercase "T"'s in addition to that, what would you do? (ie do two search and replace) |
| | If you wanted to replace all the "."'s in the text with *'s, how would you do it? (hint: be careful of characters with special meaning- you'll need to escape them.) |
From head(1) to tail(1)
Two other utilities you may want to become acquainted with are the head(1) and tail(1) utilities.
head(1) will allow you to print a specified number of lines from 1 to n. So if you needed to print, say, the first 12 lines of a file, head(1) will be a good bet.
For example, to display the first 4 lines of our sample database:
And, of course, adding it onto an existing command line using the pipe. In this example, the first two results of all the *ology Majors:
See where we're going with this? We can use these utilities to put together massively powerful command-line incantations create all sorts of interesting filters.
tail(1) works in the opposite end- starting at the end of the file and working backwards towards the beginning. So if you wanted to display the last 8 lines of a file, for example. tail(1) also has the nifty ability to continually monitor a file and update its output should the source file change. This is useful for monitoring log files that are continually updated.
Translating characters with tr
This is another useful tool to be familiar with. With tr, you can substitute any character or sequence of characters with another. The nice thing is that you can quickly use it to do end-of-line character translations, useful in converting text files from DOS format to UNIX or Mac format (or any combination therein).
ASCII file line endings
An important thing to be aware of is how the various systems terminate their lines. Check the following table:
System Line Ending Character(s) --------------------------------------------- DOS Carriage Return, Line Feed (CRLF) Mac Carriage Return (CR) UNIX Line Feed (LF)
So what does this mean to you? Well, if you have a file that was formatted with Mac-style line endings, and you're trying to read that file on a UNIX system, you may notice that everything appears as a single line at the top of the screen. This is because the Mac uses just Carriage Return to terminate its lines, and UNIX uses just Line Feeds... so the two are drastically incompatible for standard text display reasons.
For example, let's say we have a UNIX file we wish to convert to DOS format. We would need to convert every terminating Line Feed to a Carriage Return & Line Feed combination (and take note that the Carriage Return needs to come first and then the Line Feed). We would do something that looks like this:
To interpret this:
\n is the special escape sequence that we're all familiar with. In C, you can use it to issue an end-of-line character. So in UNIX, this represents a Line Feed (LF).
\r is the special escape sequence that corresponds to a Carriage Return (CR).
The first argument is the original sequence. The second is what we would like to replace it with. (in this case, replace every LF with a CRLF combination).
Then, using UNIX I/O redirection operations, file.unix is redirected as input to tr, and file.dos is created and will contain the output.
In the filters/ subdirectory of the UNIX Public Directory you will find some text files in DOS, Mac, and UNIX format.
| | Convert file.mac to UNIX format. Show me how you did this, as well as any interesting messages you find inside. |
| | Convert readme.unix to DOS format. Same deal as above. |
| | Convert dos.txt to Mac format. |
So, we've gone through lots of fun utilities, and we can hopefully do lots of neat things to text files to write home about. Time to play.
Looking back on our database (sample.db in the filters/ subdirectory of the UNIX Public Directory), let's do some more operations on it:
| | How many unique students are there in the database? (HINT: sort them in alphabetical order, and make sure there are no duplicates. Also- make sure you don't count the title banner as a "student"). |
| | How many unique majors are there in the database?
|
| | Display all the unique "favorite candies":
|
The usefulness of filters should be apparent: We can control what information we wish to see, eliminating all the undesired information. Why should we do the work in finding the information we need, when we can allow the computer to do it for us.
| | Show me the first 22 lines of this file. |
| | Show me the last 4 lines of this file |
| | Show me lines 32-48 of this file (HINT: the last 16 lines of the first 48) |
| | Of the last 12 lines in this file, show me the first 4 (n-12 through n-8) |
Being familiar with the commands and utilities available to you on the system greatly increases your ability to construct effective filters, and ultimately solve problems in a more efficient and creative manner.
All questions in this assignment require an action or response. Please organize your answers into an easily readable format and be prepared to submit the final results to your instructor.
Your assignment is expected to be performed and submitted in a clear and organized fashion- messy or unorganized assignments may have points deducted. Be sure to adhere to the submission policy.
When complete, electronically submit your assignment using the "Assignment Submitter", located here:
As always, the class mailing list is available for assistance, but not answers.