awk

AWK is a programming language for text processing, usually used in data extraction and reporting. Input file is parsed into multiple records (by default, one line is one record), and each record is parsed into multiple fields (by default, field separator is whitespaces). Use AWK to parse HTML is generally a bad idea. However, it is common to use AWK to generate HTML.

(mawk might be faster than awk)

Basic Usages

awk '{print $3}' dukeofyork.txt # prints the third field of each line
# $0 is the whole line. $1 is the first field, and so on.
awk '{print $2, $1}' names.txt # comma: insert a field separator (default is a space) between fields
awk '{print $2 ", " $1}' names.txt # insert a comma between fields by concatenation
awk '{print NF, $0}' dukeofyork.txt # NF: number of fields
awk '/up/{print NF, $0}' dukeofyork.txt # /up/ is a regex
awk 'NF==6{print}' dukeofyork.txt # print or print $0 prints the whole original line
awk -f swap.awk names.txt # use swap.awk as the awk program. In swap.awk we don't need to wrap the program in single quotes
awk -v hi=HELLO '{print $1, hi}' # -v: specify the value of an awk variable

General Structure

awk 'pattern{actions}' input_filename.txt

perform the actions on the lines that match the pattern. Default action: print the whole line. Default pattern: all lines. Multiple actions are separated by ;.

Input

Use AWK in shell script

Output

Records and Fields

Field Separator

Specify field separator:

Method 1: -F

Method 2: assign FS variable. Note that awk first divides records into fields, then call the actions. To use the new FS in the first line, use the BEGIN pattern (before any lines). Similarly there is an END pattern (after all lines in input files).

Record Separator (RS)

OFS and ORS

  • OFS: output field separator. Default is single space.

  • ORS: output record separator. Default is newline.

Variables and Operators

  • NF: number of fields on the current line.

  • NR: current line number. Note that if there are more than one input files, awk concatenates them, so the NR includes lines read from previous files.

  • FILENAME: current filename.

  • FNR: NR in the current file.

  • $0: the whole line.

  • $n: the nth field. E.g. $1 is the first field. n can be anything that has numeric values, does not have to be number literal. For example:

    • $NF: the last field

    • $(NF-1): the penultimate field.

    • $NF-1: if the last field is not a number, $NF is 0, so -1 will be printed

    • $($1): if $1 is a number (e.g. 3), this is equivalent of $3.

Assignment

Assignment of $n changes $0 and NF. Assignment of $0 changes $n and NF. Change is in memory only. AWK never changes the original file.

User Defined Variables

Names are case sensitive. Variables are treated as numbers or strings depending on context. Except user-defined functions, all variables are global.

  • Convert number to string: concat with ""

  • Convert string to number: +0

Operators

  • 2 ^ 3 is 8

  • a++ and ++a are available

  • Strings can be compared (e.g. <=) as well.

  • Concatenation: no operator

Regex

Control Structure

if/else

0 and empty string are falsey. Other values are truthy.

Arrays and For Loop

Arrays: only one dimensional, e.g. a[1]=$1. Use C-style for loop to iterate.

Associative arrays

Keys can be any string. Use for-in loop to iterate.

We can use associative arrays to simulate multi-dimensional array, e.g. a["1:2"] = 5.

Printf()

Unlike print, printf does not use OFS, ORS. Need to specify \t, \n explicitly.

  • %30s %30d: minimal width is 30, right justified.

  • %-30s: left justified.

  • %6.2f: total 6 characters. 2 after decimal point, 3 before decimal point, and 1 for the decimal point itself

  • %06.2f: left padding with 0.

Strings

AWK counts characters in strings from 1. Common methods:

  • length([string]): if no string is provided, $0 is used.

  • index(string, target): returns 0 if not found.

  • match(string, regex): similar to index, but also sets RSTART to the start position, RLENGTH to the length of the match (note that regex is greedy)

  • substr(string, start [, length]): returns the substring from start to end of line if length is omitted.

  • sub(regex, newval [, string]): $0 is used if no string is provided. Replaces the first match with newval.

  • gsub(regex, newval [, string]): Replaces all matches with newval.

  • split(string, array [, regex]): splits string into pieces, and stores them in array using regex as the separator. Returns the number of pieces it found. FS is used if regex is omitted.

Misc

Common Math Functions

  • int(x): int(3.6) == 3; int(-3.6) == -3

  • rand(): a random number in [0,1). To simulate a 6-sided die: int(rand()*6) +1

  • srand([x]): x is the optional seed. Default is current date and time. rand() uses the same seed every time you invoke it. To get different number on your first calls, use srand().

  • sqrt(x)

  • sin(x)

  • cos(x)

  • log(x): natural log of x

  • exp(x): e to the x

AWK and Excel

Use AWK to parse the csv generated by Excel:

If the field contains comma, there is no easy way for AWK to handle it. Go back to Excel and export as tsv file, and use awk -Ft. It's hard to embed a tab in a field in Excel, so that AWK can properly parse it.

AWK tutorials

Last updated

Was this helpful?