Quick guide to awk

From CUC3
Jump to navigation Jump to search

Introduction

awk is useful for combining regular expressions with editing and other functions so that you can easily modify text files with conditional flow control.

The general syntax for invoking awk is:

awk -f <script-file> <target-file>   - where script-file contains awk instructions
awk 'pattern {commands}' <target-file>    - for command line scripting
e.g. awk '/alias/' ~/.bashrc   - prints lines containing alias in your .bashrc file
e.g. awk '/export PATH/ { print $2 }' ~/.bashrc   - prints your path as defined in your .bashrc file

Taking the command line example, pattern is an optional pattern (typically a regular expression or some kind of conditional) and commands are an optional list of awk commands to be executed on lines satisfying the conditions specified by the pattern provided. If a pattern is not provided, the commands act upon all lines in the target file and if commands are not provided, awk lists all lines satisfying the pattern provided (like grep).

awk separates records defined by a record separator (newline by default) into fields defined by a field separator (space by default). Thus by default it separates input into lines and separates lines into words. Field and record separators can be reassigned either with command-line options or by altering the RS or FS built-in variables (see info awk for more details).

awk stores the fields in the current record in several variables $<n> where n is the nth field (starting from 1) so $1 is the first field, $2 is the second etc.

e.g. Given a PDB file with lines such as:

...
ATOM   49   CE1   PHE    4     ...
ATOM   50   HE1   PHE    4     ...
ATOM   51   CZ    PHE    4     ...
ATOM   52   HZ    PHE    4     ...
...

If you want the atom number corresponding to atoms with a certain type, you would search conditionally on the third field, $3 and print the second field, $2:

awk '$3 == HE1 {print $2}' <protein>.pdb

awk also supports conditional patterns, for example for HE1 ($3) in residue number 4 ($5), we would use:

awk '$3 == HE1 && $5 == 4 {print $2}' <protein>.pdb

There are many more commands in awk, all of which are listed in info awk, along with useful built-in variables (like the field separator).

Useful built-in variables

NF       - number of fields in the current input record
NR       - current record number (all files)
FNR      - current record number (in the current file)
FILENAME - current file name
RS       - input record separator (default newline)
FS       - input field separator (default space)

Useful functions

gsub(r, s, [t])    - for pattern r in string t (if t is not specified, use the whole of the current record) replace with s and return the number of substitutions
sub(r, s, [t])     - same as gsub, but only performs one substitution and then stops
length(s)          - returns the length of string s
index(s, t)        - returns the index of string t in string s (starting at 1), or 0 if string t is not found
print{expressions} - prints the expressions specified separated by the OFS (output field separator)
tolower(s)         - converts string s to lowercase
toupper(s)         - converts string s to uppercase
strtonum(s)        - converts string s to a number (assumes octal if it begins with 0, assumes hexadecimal if it begins with 0x or 0X)
substr(s, i, [n])  - returns the substring in string s beginning at position i which is at most n characters long (if n isn't specified, continues until the end of s)

There are also numerous mathematical and I/O functions which are documented in info awk.

BEGIN and END

There are two special patterns which aren't compared against the input, their commands are executed (sensibly enough) before and after all of the other awk commands. These are BEGIN and END. Simply place these in place of the pattern and their commands will operate at the appropriate time.

info awk

awk has plenty more features, all of which can be found in the info file or with Googling, so the above is very much a brief introduction, have a look at info awk.