q absolves data sins and makes CSV queries easy

The cryptically-named q (it also bills itself as being able to “Run SQL directly on CSV files | Text as Data”) is very nifty indeed. It allows you to run SQL queries on delimited text files. It seems to support the full SQLite SQL dialect, too.

I used to frequently query the IESO‘s Hourly Wind Generator Output report (which now hides behind a JS link to obscure the source URL, http://www.ieso.ca//imoweb/pubs/marketReports/download/HourlyWindFarmGen_20160122.csv).  Now that the file has nearly 10 years of hourly data and many (but not all) wind projects, it may have outlived its usefulness. But it does allow me to show off some mad SQLite skills …

The first problem is that the file uses nasty date formats. Today would be 23-Jan-16 in the report’s Date field, which is filled with the ugh. You can fix that, though, with a fragment of SQL modified from here:

printf("%4d-%02d-%02d", substr(Date, 8,2)+2000, (instr("---JanFebMarAprMayJunJulAugSepOctNovDec", substr(Date, 4,3))-1)/3, substr(Date, 1, 2)) as isodate

The above data definition sets the isodate column to be in the familiar and useful YYYY-MM-DD ISO format.

A related example would be to query the whole CSV file for monthly mean generation from Kingsbridge and K2 Wind projects (they’re next to one another) for months after K2’s commissioning in March 2015. Here’s what I did in q:

q -T -O -H -d, 'select printf("%4d-%02d", substr(Date, 8,2)+2000, (instr("---JanFebMarAprMayJunJulAugSepOctNovDec", substr(Date, 4,3))-1)/3) as isomonth, avg(KINGSBRIDGE) as kavg, avg(K2WIND) as k2avg from Downloads/HourlyWindFarmGen_20160122.csv where isomonth>"2015-03" group by isomonth'

which gave the results:

isomonth    kavg    k2avg
2015-04    12.7277777778    37.4569444444
2015-05    8.94623655914    67.6747311828
2015-06    6.05833333333    66.6847222222
2015-07    3.96370967742    45.372311828
2015-08    6.34811827957    67.436827957
2015-09    7.29027777778    79.7194444444
2015-10    14.5658602151    128.037634409
2015-11    15.9944444444    130.729166667
2015-12    17.6075268817    152.422043011
2016-01    19.6408730159    163.013888889

Neat! (or at least, I think so.)

how does he do that?

Someone asked how the automatic podcast works. It’s a bit complex, and they probably will be sorry they asked.

I have all my music saved as MP3s on a server running Firefly Media Server. It stores all its information about tracks in a SQLite database, so I can very easily grab a random selection of tracks.

Since I know the name of the track and the artist from the Firefly database, I have a selection of script lines that I can feed to flite, a very simple speech synthesizer. Each of these spoken lines is stored as as wav file, and then each candidate MP3 is converted to wav, and the whole mess is joined together using SoX. SoX also created the nifty (well, I think so) intro and outro sweeps.

The huge wav file of the whole show is converted to MP3 using LAME and uploaded to my webhost with scp. All of this process is done by one Perl script – it also creates the web page, the RSS feed, and even logs the tracks on Last.fm.

Couldn’t be simpler.

ffms2m3u.sh – create playlist for all files on a Firefly Media Server

ffms2m3u – create playlist for all files on a Firefly Media Server.

If you run Firefly Media Server, you can run this script to create an M3U of all the tracks on your server. You can play this in most audio players; VLC likes it, as does iTunes (though big playlists take an age to load). Rhythmbox and the default Ubuntu Movie Player won’t touch my playlist of over 17,000 tracks.

To configure the script you need to edit three lines:


# where the Firefly database lives
DATABASE="/usr/var/cache/mt-daapd/songs3.db"
# server domain name or IP address
SERVER="server.example.com"
# Port to talk to server - don't leave blank
PORT="3689"

If you’re running Ubuntu, you’ll probably only need to change the SERVER line. It spits the M3U playlist, ordered by album, to stdout. Note that in the default Ubuntu install, regular users can’t read the database file.

If you’re running this from a cron job, it’s probably a good idea to fill in the real paths for sqlite3 and awk.

Rise Up Singing! in freedb

It took me a while, but I finally put all the track information for Sing Out!‘s Rise Up Singing teaching CDs (also on the artists’ website) on freedb. I was given the data just over a year ago by Mark D. Moss, the editor of Sing Out! magazine.
The discs are:

Perhaps what took longest was working out a UTF-8 safe processing workflow, from converting the original Excel table to e-mailing the entries to the freedb server. Let’s just say that OpenOffice, sqlite, and Perl were very helpful here.

overfed

We had a power cut last night, and my Gregarius aggregator on the basement server really didn’t appreciate it. I think it was doing something to the sqlite database that holds the feeds when the power went out, so I lost all my configs and had to trudge through hundreds of old items.

It could be worse; you could be stabbed! (as Mark Taylor always used to say).

the commitments

When I was testing BlackBerry typed-alike words (dactonyms?) I found that sqlite was averaging about 1 insert per second. This is by no means good.

It turns out that, under Perl, sqlite auto-commits after every write. This slows things down terribly. Here’s how to fix this:

When opening the database handle, turn AutoCommit off:

my $dbh =
DBI->connect( “dbi:SQLite:bberry2.sqlite”, “”, “”, { AutoCommit => 0 } )
or die “$!”;

Then, only commit occasionally — say every thousand writes:

while ( … ) {

…$id++;
$dbh->commit unless ( $id % 1000 );

}
$dbh->commit;

It works out about 1000 times quicker this way.

best beat neat nest

Beware, nerdiness follows: I generally like my BlackBerry 7130e, but its multiple letters per key can sometimes give the wrong result. Using word frequency lists from the British National Corpus, sqlite, and way too much programming time, I determined that the key sequence with the most possible word results (81?2) produces best, beat, neat or nest. The device itself suggests also brat and bray, so I should try a longer word list — in my copious free time, of course.
The longest (common words in the corpus) that have the same key sequence are employers and employees, which might briefly cause hilarity in an HR or legal context.