awk
AWK is a programming language for text processing, usually used in data extraction and reporting. Input file is parsed into multiple records (by default, one line is one record), and each record is parsed into multiple fields (by default, field separator is whitespaces). Use AWK to parse HTML is generally a bad idea. However, it is common to use AWK to generate HTML.
(mawk might be faster than awk)
Basic Usages
General Structure
awk 'pattern{actions}' input_filename.txt
perform the actions on the lines that match the pattern. Default action: print the whole line. Default pattern: all lines. Multiple actions are separated by ;
.
Input
Use AWK in shell script
Output
Records and Fields
Field Separator
Specify field separator:
Method 1: -F
Method 2: assign FS
variable. Note that awk first divides records into fields, then call the actions. To use the new FS in the first line, use the BEGIN pattern (before any lines). Similarly there is an END pattern (after all lines in input files).
Record Separator (RS)
OFS and ORS
OFS: output field separator. Default is single space.
ORS: output record separator. Default is newline.
Variables and Operators
NF: number of fields on the current line.
NR: current line number. Note that if there are more than one input files, awk concatenates them, so the NR includes lines read from previous files.
FILENAME: current filename.
FNR: NR in the current file.
$0: the whole line.
$n: the nth field. E.g. $1 is the first field. n can be anything that has numeric values, does not have to be number literal. For example:
$NF: the last field
$(NF-1): the penultimate field.
$NF-1: if the last field is not a number, $NF is 0, so -1 will be printed
$($1): if $1 is a number (e.g. 3), this is equivalent of $3.
Assignment
Assignment of $n changes $0 and NF. Assignment of $0 changes $n and NF. Change is in memory only. AWK never changes the original file.
User Defined Variables
Names are case sensitive. Variables are treated as numbers or strings depending on context. Except user-defined functions, all variables are global.
Convert number to string: concat with
""
Convert string to number: +0
Operators
2 ^ 3
is 8a++
and++a
are availableStrings can be compared (e.g.
<=
) as well.Concatenation: no operator
Regex
Control Structure
if/else
0 and empty string are falsey. Other values are truthy.
Arrays and For Loop
Arrays: only one dimensional, e.g. a[1]=$1
. Use C-style for loop to iterate.
Associative arrays
Keys can be any string. Use for-in loop to iterate.
We can use associative arrays to simulate multi-dimensional array, e.g. a["1:2"] = 5
.
Printf()
Unlike print, printf does not use OFS, ORS. Need to specify \t
, \n
explicitly.
%30s %30d
: minimal width is 30, right justified.%-30s
: left justified.%6.2f
: total 6 characters. 2 after decimal point, 3 before decimal point, and 1 for the decimal point itself%06.2f
: left padding with 0.
Strings
AWK counts characters in strings from 1. Common methods:
length([string])
: if no string is provided, $0 is used.index(string, target)
: returns 0 if not found.match(string, regex)
: similar to index, but also sets RSTART to the start position, RLENGTH to the length of the match (note that regex is greedy)substr(string, start [, length])
: returns the substring from start to end of line if length is omitted.sub(regex, newval [, string])
: $0 is used if no string is provided. Replaces the first match with newval.gsub(regex, newval [, string])
: Replaces all matches with newval.split(string, array [, regex])
: splits string into pieces, and stores them in array using regex as the separator. Returns the number of pieces it found. FS is used if regex is omitted.
Misc
Common Math Functions
int(x)
: int(3.6) == 3; int(-3.6) == -3rand()
: a random number in [0,1). To simulate a 6-sided die:int(rand()*6) +1
srand([x])
: x is the optional seed. Default is current date and time. rand() uses the same seed every time you invoke it. To get different number on your first calls, use srand().sqrt(x)
sin(x)
cos(x)
log(x)
: natural log of xexp(x)
: e to the x
AWK and Excel
Use AWK to parse the csv generated by Excel:
If the field contains comma, there is no easy way for AWK to handle it. Go back to Excel and export as tsv file, and use awk -Ft
. It's hard to embed a tab in a field in Excel, so that AWK can properly parse it.
AWK tutorials
Last updated