No Disconnect

Things I’d rather not forget

Archive for the ‘awk’ Category

Concatenate many text files with awk

leave a comment »

Here is code to concatenate a lot of text files into one single file. The ls and cat commands come from my Cygwin installation, apparently. In this case the files to be concatenated are all of the files in the directory starting with ‘str’ (i.e., “ls str*”).

ls str* | gawk '{print "cat "$1;}' | sh > result.txt

Written by nodisconnect

April 20, 2010 at 8:24 am

Posted in awk

Piping from the shell to awk to the shell

leave a comment »

I wanted lots of commands that look like this, namely, one for each .htm file in the directory:

tidy --asxhtml --add-xml-decl yes --output-xml yes 1ch1.htm >  xml/1ch1.xml

So we take an ls listing, pipe it to awk, use awk to generate the commands, and then pipe it back to the shell for execution. This is stretching the advantage of the “one-line awk program”… oh well.

ls | gawk '{ xfn = $1; sub(/htm/,"xml",xfn); print "tidy --asxhtml --add-xml-decl yes --output-xml yes "$1" > xml/"xfn; }' | sh

One awkward thing (ha ha), or at least unexpected thing, is that the sub function takes the “haystack” (third argument) as if by reference. It modifies the variable itself, and the return value is the number of replacements made. So I had to create a new variable equal to $1, and then run the substitution on that variable.

Written by nodisconnect

March 21, 2009 at 1:26 am

Posted in awk

Piping awk

leave a comment »

Nice way to get a list of filenames on a single line:

ls *.tif | gawk '{print $1;}';

Written by nodisconnect

March 16, 2009 at 11:13 am

Posted in awk

Getting started with awk

leave a comment »

awk was once a popular programming language, intended for parsing batches of records. Good docs. I’ve used it a bit, and I like what I see. Here is an example CSV file.

Name,Number,Moral Status
Jean Valjean,24601,Redeemed
James Bond,007,Ambiguous

And some awk code to convert it to XML:

BEGIN{ print "<?xml version="1.0" encoding="utf-8"?>"; print "<records>"; }
FNR<2{ for(i=1; i<=NF; i++) { label[i]=$i; } }
FNR>1{	print "<record>";
	for(i=1; i<=NF; i++) {
		print "<"label[i]">"$i"</"label[i]">";
	}
	print "</record>";
}
END{ print "</records>"; }

Which is invoked with:

gawk -F, -f csv2xml.awk example.csv > example.xml

The -F, parameter tells awk to separate records with a comma. Other usage should be clear.

As one might expect, the BEGIN and END tags are executed at the beginning and end of program execution. FNR<2 is fun: it’s executed for the records whose position is less than 2 (i.e., the first record). I’m using it here to loop through the number of fields (NF) and assign elements to the label array. First, second, and third fields are referred to with $1, $2, $3, and as seen above, that can be specified dynamically.

Then for all other rows (FNR>1), we print a <record> wrapper, and then elements with the “label” as the name of the child elements. Double-quotes are escaped as usual.

This produces:

<?xml version="1.0" encoding="utf-8"?>
<records>
<record>
<Name>Jean Valjean</Name>
<Number>24601</Number>
<Moral Status>Redeemed</Moral Status>
</record>
<record>
<Name>James Bond</Name>
<Number>007</Number>
<Moral Status>Ambiguous</Moral Status>
</record>
</records>

Written by nodisconnect

March 13, 2009 at 6:10 pm

Posted in awk, Programming

Follow

Get every new post delivered to your Inbox.