Awk Programming Language Article Index for
Awk
Website Links For
Awk
 

Information About

Awk Programming Language




  paradigm Scripting Language , Procedural , Event-driven
  year 1977, last revised 1985, current POSIX edition is IEEE Std 10031-2004
  designer Alfred '''A'''ho , Peter '''W'''einberger , and Brian '''K'''ernighan
  typing none can handle strings, integers and floating point numbers regular expressions
  implementations awk, GNU Awk, mawk, nawk, MKS AWK, Thompson AWK (compiler), Awka (compiler)
  dialects ''old awk'' oawk 1977, ''new awk'' nawk 1985, GNU Awk
  influenced By C , SNOBOL 4, Bourne Shell
  influenced Perl , Korn Shell (''ksh93'', ''dtksh'', ''tksh''), JavaScript


AWK is a general purpose Computer Language that is designed for processing text-based data, either in files or data streams. The name AWK is derived from the Surname s of its authors — Alfred '''A'''ho , Peter '''W'''einberger , and Brian '''K'''ernighan ; however, it is commonly pronounced "awk" and not as a string of separate letters. awk, when written in all lowercase letters, refers to the Unix or Plan 9 program that runs other programs written in the AWK programming language.

AWK is an example of a Programming Language that extensively uses the String Datatype , Associative Array s (that is, arrays indexed by key strings), and Regular Expression s. The power, terseness, and limitations of AWK programs and Sed scripts inspired Larry Wall to write Perl . Because of their dense notation, all these languages are often used for writing One-liner Program s.

AWK is one of the early tools to appear in Version 7 Unix and gained popularity as a way to add computational features to a Unix Pipeline .
A version of the AWK language is a standard feature of nearly every modern Unix-like Operating System available today. AWK is mentioned in the Single UNIX Specification as one of the mandatory utilities of a Unix Operating System . Besides the Bourne Shell , AWK is the only other scripting language available in a standard Unix environment . Implementations of AWK exist as installed software for almost all other operating systems.


STRUCTURE OF AWK PROGRAMS

Generally speaking, two pieces of data are given to AWK: a command file and a primary input file. A command file (which can be an actual file, or can be included in the Command Line invocation of awk) contains a series of commands which tell AWK how to process the input file. The primary input file is typically text that is formatted in some way; it can be an actual file, or it can be read by awk from the standard input. A typical AWK program consists of a series of lines, each of the form

/''pattern''/ { ''action'' }

where ''pattern'' is a regular expression and ''action'' is a command. Most implementations of AWK use Extended Regular Expressions by default. AWK looks through the input file; when it finds a line that matches ''pattern'', it executes the command(s) specified in ''action''. Alternate line forms include:

; BEGIN { ''action'' }
: Executes ''action'' commands at the beginning of the script execution, i.e. before any of the lines are processed.
; END { ''action'' }
: Similar to the previous form, but executes ''action'' ''after'' the end of input.
; /''pattern''/
: Prints any lines matching ''pattern''.
; { ''action'' }
: Executes ''action'' for each line in the input.

Each of these forms can be included multiple times in the command file. Lines in the command file are executed in order, so if there are two "BEGIN" statements, the first is executed, then the second, and then the rest of the lines. BEGIN and END statements do ''not'' have to be located before and after (respectively) the other lines in the command file.

AWK was created as a broadbased replacement to C algorithmic approaches developed to integrate text parsing methods.


AWK COMMANDS

AWK commands are the statement that is substituted for ''action'' in the examples above. AWK commands can include function calls, variable assignments, calculations, or any combination thereof. AWK contains built-in support for many functions; many more are provided by the various flavors of AWK. Also, some flavors support the inclusion of Dynamically Linked Libraries , which can also provide more functions.

For brevity, the enclosing curly braces ( ''{ }'' ) will be omitted from these examples.


The ''print'' command

The ''print'' command is used to output text. The simplest form of this command is

print

This displays the contents of the current line. In AWK, lines are broken down into ''fields'', and these can be displayed separately:

; print $1
: Displays the first field of the current line
; print $1, $3
: Displays the first and third fields of the current line, separated by a predefined string called the output field separator (OFS) whose default value is a single space character

Although these fields ('''') may bear resemblance to variables (the $ symbol indicates variables in perl), they actually refer to the fields of the current line. A special case, ''$0'', refers to the entire line. In fact, the commands "print" and "print $0" are identical in functionality.

The ''print'' command can also display the results of calculations and/or function calls:

print 3+2
print foobar(3)
print foobar(variable)
print sin(3-2)

Output may be sent to a file:

print "expression" > "file name"


Variables, et cetera

  • /'' are addition, subtraction, multiplication, and division, respectively. For string concatenation, simply place two variables (or string constants) next to each other, optionally with a space in between. String constants are Delimited by double quotes. Statements need not end with semicolons. Finally, comments can be added to programs by using ''#'' as the first character on a line.



User-defined functions

In a format similar to C , function definitions consist of the keyword function, the function name, argument names and the function body. Here is an example function:

function add_three(number, temp) {
temp = number + 3
return temp
}

This statement can be invoked as follows:

print add_three(36) # prints 39

Functions can have variables that are in the local scope. The names of these are added to the end of the argument list, though values for these should be omitted when calling the function. It is convention to add some Whitespace in the argument list before the local variables, in order to indicate where the parameters end and the local variables begin.


SAMPLE APPLICATIONS


Hello World

Here is the ubiquitous " Hello World Program " program written in AWK:

BEGIN { print "Hello, world!"; exit }


Print lines longer than 80 characters

Print all lines longer than 80 characters. Note that the default action is to print the current line.

length > 80


Print a count of words

Count words in the input, and print lines, words, and characters (like Wc )

{ w += NF; c += length}
END { print NR, w, c }


Sum first column

Sum first column of input

{ s += $1 }
END { print s }


Calculate word frequencies

Word frequency, (uses Associative Array s)

BEGIN { FS=" {Link without Title} +"}

{ for (i=1; i<=NF; i++)
words {Link without Title} ++
}

END { for (i in words)
print i, words {Link without Title}
}


SELF-CONTAINED AWK SCRIPTS

As with many other programming languages, self-contained AWK script can be constructed using the so-called " Shebang " syntax.

For example, a UNIX command called hello.awk that prints the string "Hello, world!" may be built by going first creating a file named hello.awk containing the following lines:

#!/usr/bin/awk -f
BEGIN { print "Hello, world!"; exit }


AWK VERSIONS AND IMPLEMENTATIONS

AWK was originally written in 1977 , and distributed with Version 7 Unix .

In 1985 its authors started expanding the language, most significantly by adding user-defined functions. The language is described in the book ''The AWK Programming Language'', published 1988 , and its implementation was made available in releases of UNIX System V . To avoid confusion with the incompatible older version, this version was sometimes known as "new awk" or ''nawk''. This implementation was released under a Free Software License in 1996 , and is still maintained by Brian Kernighan. (see external links below)

GNU awk, or ''gawk'', is another free software implementation. It was written before the original implementation became freely available, and is still widely used. Almost every Linux Distribution comes with a recent version of ''gawk'' and ''gawk'' is widely recognized as the de-facto standard implementation in the Linux world.

''xgawk''
is a SourceForge project based on ''gawk''. It extends ''gawk'' with dynamically loadable libraries.

mawk is a very fast AWK implementation by Mike Brennan based on a Byte Code interpreter.

Downloads and further information about these versions are available from the sites listed below.

Thompson AWK or TAWK is an AWK Compiler for DOS and Windows , previously sold by Thompson Automation Software (which has ceased its activities).


CRITICISM


Three kinds of criticism can be distinguished:
#Language specific
##AWK cannot handle NULL characters in the input and output streams. The GNU implementation has resolved this problem, but AWK in general has not.
##AWK looks like C, but it isn't C. Local variables (inside functions) are only emulated. Multidimensional arrays are only emulated. The ''switch'' statement is missing (only available in GNU implementation).
##Checking strings for numeric contents is difficult.
#Implementation specific
##Sun's implementation is notoriously different from standards and other implementations. Solaris comes with three different implementations of AWK: ''oawk'' (aka old awk), ''nawk'' (aka new awk), ''pawk'' (aka POSIX awk) which all behave differently in some situations.
##Implementations targetting MS-DOS and MS Windows are notorious for several problems
###Memory allocation is sometimes still limited to 640 KB.
###The most common problem with AWK on Microsoft-platforms is the quoting problem: When using AWK source code on the command line, users often overlook the details of quoting. While the ' and " characters have a well-defined meaning in Unix environments, their semantics in the command line interpreter of Microsoft is different.
###Text lines are terminated with a Carriage Return and an additional Line Feed character. Platform like Cygwin have solutions for this.

#Myths and Misconceptions
## Line length is limited to 4096 or 8192 characters. This myth originates from early versions.
## AWK has no functions or subroutines. This is only true for the ''oawk'' implementation from 1977.
## AWK has no Hashes . AWK predates the invention of the term ''hash''. AWK and its predecessor SNOBOL were actually the first languages which offered hashes, but under the ''associative array'' moniker.
## ''i18n'' (internatialization) breaks old scripts. With the introduction of Unicode and international locales, the order of characters depends on the locale; therefore regular expressions like '' {Link without Title} '' change their semantics in different locales. In some European locales, the decimal point is replaced by a comma when printing numbers. So, the output of all scripts printing numbers depends on the locale. Reading back this printed data may truncate numbers without notice.
## AWK has been ''replaced'' by other scripting languages like Perl . AWK will not disappear or be ''replaced''. Since the introduction of Unix standards, AWK has always been mentioned in documents like the current Single UNIX Specification as one of the mandatory utilities of a Unix Operating System

The POSIX standard for AWK requires that "The value of an integer constant beginning with 0 shall be taken in decimal rather than octal". (In C and other languages derived from it, integer literals beginning with 0 are read as octal.)

This is the behaviour of most versions of AWK, the GNU version, gawk, being the exception.

In gawk:
$ gawk 'BEGIN { print "06612"; print 6612; print 06612;}'
06612
6612
3466
$

In other AWKs (pre-1987 awk, new awk, Brian Kernighan's One True Awk, mawk) the POSIX behaviour is standard. For example:

$ mawk 'BEGIN { print "06612"; print 6612; print 06612;}'
06612
6612
6612
$ nawk 'BEGIN { print "06612"; print 6612; print 06612;}'
06612
6612
6612

The POSIX behaviour is available with gawk by setting the environment variable POSIXLY_CORRECT, or by calling gawk with the --posix option:

$ gawk --posix 'BEGIN { print "06612"; print 6612; print 06612;}'
06612
6612
6612


DIGRESSION

  • The bird emblematic of AWK (a.o. on ''The AWK Programming Language'' book cover) is the Auk .



BOOKS



SEE ALSO



EXTERNAL LINKS