Archive for the ‘awk’ Category
Concatenate many text files with awk
Here is code to concatenate a lot of text files into one single file. The ls and cat commands come from my Cygwin installation, apparently. In this case the files to be concatenated are all of the files in the directory starting with ‘str’ (i.e., “ls str*”).
ls str* | gawk '{print "cat "$1;}' | sh > result.txt
Piping from the shell to awk to the shell
I wanted lots of commands that look like this, namely, one for each .htm file in the directory:
tidy --asxhtml --add-xml-decl yes --output-xml yes 1ch1.htm > xml/1ch1.xml
So we take an ls listing, pipe it to awk, use awk to generate the commands, and then pipe it back to the shell for execution. This is stretching the advantage of the “one-line awk program”… oh well.
ls | gawk '{ xfn = $1; sub(/htm/,"xml",xfn); print "tidy --asxhtml --add-xml-decl yes --output-xml yes "$1" > xml/"xfn; }' | sh
One awkward thing (ha ha), or at least unexpected thing, is that the sub function takes the “haystack” (third argument) as if by reference. It modifies the variable itself, and the return value is the number of replacements made. So I had to create a new variable equal to $1, and then run the substitution on that variable.
Piping awk
Nice way to get a list of filenames on a single line:
ls *.tif | gawk '{print $1;}';
Getting started with awk
awk was once a popular programming language, intended for parsing batches of records. Good docs. I’ve used it a bit, and I like what I see. Here is an example CSV file.
Name,Number,Moral Status Jean Valjean,24601,Redeemed James Bond,007,Ambiguous
And some awk code to convert it to XML:
BEGIN{ print "<?xml version="1.0" encoding="utf-8"?>"; print "<records>"; }
FNR<2{ for(i=1; i<=NF; i++) { label[i]=$i; } }
FNR>1{ print "<record>";
for(i=1; i<=NF; i++) {
print "<"label[i]">"$i"</"label[i]">";
}
print "</record>";
}
END{ print "</records>"; }
Which is invoked with:
gawk -F, -f csv2xml.awk example.csv > example.xml
The -F, parameter tells awk to separate records with a comma. Other usage should be clear.
As one might expect, the BEGIN and END tags are executed at the beginning and end of program execution. FNR<2 is fun: it’s executed for the records whose position is less than 2 (i.e., the first record). I’m using it here to loop through the number of fields (NF) and assign elements to the label array. First, second, and third fields are referred to with $1, $2, $3, and as seen above, that can be specified dynamically.
Then for all other rows (FNR>1), we print a <record> wrapper, and then elements with the “label” as the name of the child elements. Double-quotes are escaped as usual.
This produces:
<?xml version="1.0" encoding="utf-8"?> <records> <record> <Name>Jean Valjean</Name> <Number>24601</Number> <Moral Status>Redeemed</Moral Status> </record> <record> <Name>James Bond</Name> <Number>007</Number> <Moral Status>Ambiguous</Moral Status> </record> </records>