New York Linux Scene Journal: Look Who's Gawk-ing

Look Who's Gawk-ing!!!
By Marcus Conti
A Brief Introduction to the GNU awk Utility

Gawk is the GNU version of awk, an interpreted text manipulation language. The name is derived from initials of the last names of the original authors, Alfred Aho, Peter Weinberger, and Brian Kernighan. The syntax of awk is similar in many ways to C syntax, which is not surprising since Kernighan is also one of the original authors of the C programming language. Some claim that awk inspired, in part, the development of Perl. People familiar with C, Perl, and/or shell scripting should easily find a way to sink their teeth into awk or its GNU implementation gawk.

This article will be dealing with gawk, although the concepts and examples should work with other implementations of awk. It is assumed for this article that you are using a Free Software Operating System (this means GNU/Linux and several BSD OSes, among others) or some UNIX-like OS. gawk can be described more fully as a pattern-matching program for processing text files containing data in fields separated by a delineator. An example of such a file is a text file that lists people's names, phone numbers, and email addresses separated by tabs. Another example is a file some_file resulting from the shell command ls -l > some_file. You can use gawk to select lines from the file by matching values in the fields against patterns you define. The patterns used for matching can be regular expressions, relational expressions, or simple pattern-matching expressions. This ability is similar to the grep program. You can also use gawk to print selected fields from matched lines. The combination of these two capabilities allows you to use delimited text files as a no-frills database for situations in which using a real database is not warranted.

Now let's get into some gawk examples. We will be reading and analyzing data from the file that lists users on your system, which is usually called passwd and located in the /etc directory. Here is a simple example that lists the first field of every line in your /etc/passwd file. Type the the following at a command prompt:

gawk -F ":" '{print $1}' /etc/passwd

Try it. This will show you all the users on the system. The result should look something like this:

root

daemon

bin

sys

...

Now let's go through the example step by step. The gawk portion invokes the interpreter, and the -F ":" tells gawk that a colon (":") is the character used in between the fields in the file, called a field separator or delineator. The field separator could also be specified as -F: or as -F :. These versions without quotes are equivalent to the version used in the previous example. Also note that the default field separator is any "white space", i.e. a tab or space character, so if no separator is specified, gawk will split lines using spaces and/or tabs if any are present. gawk uses ' ' characters, or single quotes, around curly braces, {}, to enclose the script it interprets, as in: '{script...}'. In the example, the script does not do any pattern-matching and therefore all lines are returned. The print $1 tells gawk to print the first field from each returned line. Finally, /etc/passwd is the file which gawk will read as data. Note that if one wanted to print the second and third fields as well, the script would look like this:

gawk -F: '{print $1; print $2; print $3}' /etc/passw

Notice the semicolons ( ; ) between print commands. These are used to separate commands in gawk. If you tried the last command, you probably noticed that each field is printed on a new line. If you prefer all the fields on one line separated by a space, use this instead:

{print $1, $2, $3}

Or if you prefer them separated by another character, maybe a comma, try this script instead:

{print $1 "," $2 "," $3}

The double quotes tell gawk that you want the characters inside to be literally printed to the output. The characters within quotes are thus called string literals or just literals. You can see how longer scripts can become cumbersome on the command line. This is why gawk allows you to call scripts from files. Open a new file in your favorite editor and enter the following text:

BEGIN { FS=":" }

{print $1, $2, $3}

The BEGIN { FS=":" } line tells gawk that the field separator is a colon. This is equivalent to using -F: on the command line. The rest of the script is contained in a set of curly braces with commands separated by semicolons. Now save the file and execute it using the following command, replacing "your_script" with the name of your script:

gawk -f your_script /etc/passwd

Let's build a script that will use some of gawk's pattern-matching capabilities. First, a brief explanation of simple pattern-matching expressions. In pattern matching, ~ (the tilde character) means "matches", as in:

$1 ~ "root"{print}

The above line will print any lines in which the first field or part of the first field contains the string of characters in between the quotation marks, in this case 'root'. Note that if the field was 'treeroot' or 'roots', the line would be printed as well. To specify that the field should match exactly, use the special characters ^ and $ to specify the beginning and end of the field, like this:

$1 ~ "^root$"{print}

There is also a way to specify that you want lines that don't match a certain string of characters: use !~ instead of ~ . Where quotes are used in the above examples, slashes ( / ) can be used to specify a regular expression pattern match, but regular expressions are not covered in any detail here.

So, let's build a script that will print the user names of all users on your system that use a particular shell on your system. In the /etc/passwd file there is a field that specifies the user's default shell. In most cases it is the seventh field, but you make have to look at the file to find out which field it is on your system. For most users it will look something like /bin/sh or /bin/csh . Let's see how many users have bash (Bourne Again Shell) as the default shell. Try the following script:

# Script to see who uses bash

BEGIN{ FS=":" }

$7 ~ "/bash$"{print $1}

The first line is a comment and ignored by gawk. Comments start with the pound sign ( # ), and everything on the rest of the line is disregarded. Another thing to note is that we used the dollar sign ( $ ) in the search pattern again so that we are certain that the pattern matches against the end of the field. Otherwise, if a system kept shells in a directory that began with 'bash', it would be printed out, though that is not our intention. This way, it does not matter in which directory shell programs are located; we are only concerned with what the shell is called.

In most cases, if bash is available, the shell name 'sh' is just another name for bash. Those are not counted by the preceding script, so let's rewrite the script to account for both names. And while we are at it, let's provide a count of the users that use bash, and counts of how many specify each of the names. For this task we will need to use flow control commands ( if, else, for, etc.), which are very similar to Perl and C. We will also need to group statements together using curly braces ( { } ), just as in Perl and C. Here is the script, with brief comments included:

#(1) use BEGIN to set the field seperator and set

# the total number of bash users to zero

BEGIN{

FS=":"; #(2) semicolon seperates statements

TOTAL = 0 # return or linebreak seperates statements also

NUM_BASH = 0 # either works, it is up to you

}

#(3) start the main processing loop that is called for each line of input

#(4) first check if the user's shell is some form of bash

($7 ~ "/bash$") || ($7 ~ "/sh$"){

#(5) if it is bash, add 1 to the total

TOTAL = TOTAL + 1

#(6) the above line could be written as TOTAL++ (like C and Perl)

#(7) see if the default shell is named bash

if ($7 ~ "/bash$"){

#(8) add 1 to the number who call it bash

NUM_BASH++

}

#(9) see if the shell is named sh

else if ($7 ~ "/sh$"){

#(10) add 1 to the number who know it by sh

NUM_SH++

}

#(11) print the user name since they use bash (by some name)

print $1

}

#(12) use END to print final results at the end of processing

END{

#(13) print a few *'s to end the user list

print "********"

#(14) print the total and explain what it means

print TOTAL " users have bash as the default shell."

print NUM_SH " call it \"sh\". " NUM_BASH " call it \"bash\"."

}

Run it and note the output. Now for the explanation. The first part is similar to the previous example, but then we see the first variable declaration. Near comment (2), TOTAL is set to zero. In gawk, variables are not "typed" as they are in many other languages, so you simply write the name of the variable, an equal sign ( = ), and the value it will hold. Note that gawk does not require variable declaraction; you can just start using a variable and gawk will assign a default value to the variable.

Next the main loop begins, looking at each line in turn. The line after comment (4) checks whether the seventh field matches 'bash' or 'sh'. See how the first half of the line looks like the pattern-matching in the previous example. Notice that the two checks are each grouped in parentheses seperated by two pipes, || . Just like in C and Perl, || means 'or' in gawk. So if a line matches either of the two expressions, the commands between the curly braces ({}) will be executed.

The first statement adds 1 to the total. Next we check whether the line matched 'bash' or 'sh' and add 1 to the appropriate tally. Notice that the variable NUM_SH is never declared, it is first used when it is incremented (has 1 added to it). gawk assigns it to zero automatically; if it were being used as a string it would be initialized to an empty string to begin with. So as long as you are satisfied with the auto-initialized values, you can just start using variables without declaration.

After all the right counters are incremented, the user name (field 1) is printed for each qualifying line. Finally, the END section is used to report the totals. The END section is useful for this job because it is executed once, after all lines in the data file have been inspected and processed apropriately.

Note that gawk takes input from Standard In (STDIN) if no data file is specified. So it can be manipulated on the command line using input/output redirection (e.g. >> ), making it a powerful tool for the command line. gawk is especially useful when combined with other tools, in particular sed.

Hopefully, you now have enough information to make use of gawk or even add it to your *nix utility belt. Happy gawk-ing!