can we…?

This is a mini celebratory post to say that I’ve fixed the database encoding problems on this blog. It looks like I will have to go through the posts manually to correct the errors still, but at least I can enter, store and display UTF-8 characters as expected.

“? µ ° × — – ½ ¾ £ é?êè”, he said with some relief.

Postmortem: For reasons I cannot explain or remember, the database on this blog flipped to an archaic character set: latin1, aka ISO/IEC 8859-1. A partial fix was effected by downloading the entire site’s database backup, and changing all the following references in the SQL:

CHARSET=latin1 → CHARSET=utf8mb4
COLLATE=latin1_german2_ci → COLLATE=utf8mb4_general_ci
COLLATE utf8mb4_general_ci → COLLATE utf8mb4_general_ci
latin1_general_ci → utf8mb4_general_ci
COLLATE latin1_german2_ci → COLLATE utf8mb4_general_ci
CHARACTER SET latin1 → CHARACTER SET utf8mb4

For additional annoyance, the entire SQL dump was too big to load back into phpmyadmin, so I had to split it by table. Thank goodness for awk!

#!/usr/bin/awk -f

BEGIN {
    outfile = "nothing.sql";
}

/^# Table: / {
    # very special comment in WP backup that introduces a new table
    # last field is table_name,
    # which we use to create table_name.sql
    t = $NF
    gsub(/`/, "", t);
    outfile = t ".sql";
}

{
    print > outfile;
}

The data still appears to be confused. For example, in the post Compose yourself, Raspberry Pi!, what should appear as “That little key marked “Compose”” appears as “That little key marked â€œComposeâ€Â”. This isn’t a straight conversion of one character set to another. It appears to have been double-encoded, and wrongly too.

Still, at least I can now write again and have whatever new things I make turn up the way I like. Editing 20 years of blog posts awaits … zzz

Leave a comment