A terrible guide to singing with DECtalk

It’s now possible to build and run the DECtalk text to speech system on Linux. It even builds under emscripten, enabling DECtalk for Web in your browser. You too can annoy everyone within earshot making it prattle on about John Madden.

But DECTalk can sing! Because it’s been around so long, there are huge archives of songs in DECtalk format out there. The largest archive is at THE FLAME OF HOPE website, under the Dectalk section.

Building DECtalk songs isn’t easy, especially for a musical numpty like me. You need a decent grasp of music notation, phonemic/phonetic markup and patience with DECtalk’s weird and ancient text formats.

DECtalk phonemes

While DECtalk can accept text and turn it into a fair approximation of spoken English, for singing you have to use phonemes. Let’s say we have a solfège-ish major scale:

do re mi fa sol la ti do

If we’re all fancy-like and know our International Phonetic Alphabet (IPA), that would translate to:

/doʊ ɹeɪ miː faː soʊ laː tiː doʊ/

or if your fonts aren’t up to IPA:

/doʊ ɹeɪ miː faː soʊ laː tiː doʊ/
(maybe)

/do? ?e? mi? f?? so? l?? ti? do?/

(sorry about this, my website hates Unicode still)

DECtalk uses a variant on the ARPABET convention to represent IPA symbols as ASCII text. The initial consonant sounds remain as you might expect: D, R, M, F, S, L and T. The vowel sounds, however, are much more complex. This will give us a DECtalk-speakable phrase:

[dow rey miy faa sow laa tiy dow].

Note the opening and closing brackets and the full stop at the end. The brackets introduce phonemes, and the full stop tells DECtalk that the text is at an end. Play it in the DECtalk for Web window and be unimpressed: while the pitch changes are non-existent, the sounds are about right.

For more information about DECtalk phonemes, please see Phonemic Symbols Listed By Language and chapter 7 of DECtalk DTC03 Text-to-Speech System Owner’s Manual.

If you want to have a rough idea of what the phonemes in a phrase might be, you can use DECtalk’s :log phonemes option. You might still have to massage the input and output a bit, like using sed to remove language codes:

say -l us -pre '[:log phonemes on]' -post '[:log phonemes off]' -a "doe ray me fah so lah tea doe" | sed 's/us_//g;'
d ' ow  r ' ey  m iy  f ' aa) s ow  ll' aa  t ' iy  d ' ow.

Music notation

To me — a not very musical person — staff notation looks like it was designed by a maniac. A more impractical system to indicate arrangement of notes and their durations I don’t think I could come up with: and yet we’re stuck with it.

DECtalk uses a series of numbered pitches plus durations in milliseconds for its singing mode. The notes (1–37) correspond to C2 to C5. If you’re familiar with MIDI note numbers, DECtalk’s 1–37 correspond to MIDI note numbers 36–72. This is how DECtalk’s pitch numbers would look as major scales on the treble clef:

a treble clef showing quarter notes from C2 to C5 in the scale of C Major
The entire singing range of DECtalk as a C Major scale, from note 1 (C2, 65.4 Hz) to note 37 (C5, 523.4 Hz)

I’m not sure browsers can play MIDI any more, but here you go (doremi-abc.mid):

and since I had to learn abc notation to make these noises, here is the source:

%abc-2.1
X:1
T:Do Re Mi
C:Trad.
M:4/4
L:1/4
Q:1/4=120
K:C
%1
C,, D,, E,, F,,| G,, A,, B,, C,| D, E, F, G,| A, B, C D| E F G A| B c z2 |]
w:do re mi fa sol la ti do re mi fa sol la ti do re mi fa sol la ti do

Each element of a DECtalk song takes the following form:

phoneme <duration, pitch number>

The older DTC-03 manual hints that it takes around 100 ms for DECtalk to hit pitch, so for each ½ second utterance (or quarter note at 120 bpm, ish), I split it up as:

  • 100 ms of the initial consonant;
  • 337 ms of the vowel sound;
  • 63 ms of pause (which has the phoneme code “_”). Pauses don’t need pitch numbers, unless you want them to preempt DECtalk’s pitch-change algorithm.

So the three lowest notes in the major scale would sing as:

[d<100,1>ow<337,1>_<63>
 r<100,3>ey<337,3>_<63>
 m<100,5>iy<337,5>_<63>].

I’ve split them into line for ease of reading, but DECtalk adds extra pauses if you include spaces, so don’t.

The full three octave major scale looks like this:

[d<100,1>ow<337,1>_<63>r<100,3>ey<337,3>_<63>m<100,5>iy<337,5>_<63>f<100,6>aa<337,6>_<63>s<100,8>ow<337,8>_<63>l<100,10>aa<337,10>_<63>t<100,12>iy<337,12>_<63>d<100,13>ow<337,13>_<63>r<100,15>ey<337,15>_<63>m<100,17>iy<337,17>_<63>f<100,18>aa<337,18>_<63>s<100,20>ow<337,20>_<63>l<100,22>aa<337,22>_<63>t<100,24>iy<337,24>_<63>d<100,25>ow<337,25>_<63>r<100,27>ey<337,27>_<63>m<100,29>iy<337,29>_<63>f<100,30>aa<337,30>_<63>s<100,32>ow<337,32>_<63>l<100,34>aa<337,34>_<63>t<100,36>iy<337,36>_<63>d<100,37>ow<337,37>_<63>].

You can paste that into the DECtalk browser window, or run the following from the command line on Linux:

say -pre '[:PHONE ON]' -a '[d<100,1>ow<337,1>_<63>r<100,3>ey<337,3>_<63>m<100,5>iy<337,5>_<63>f<100,6>aa<337,6>_<63>s<100,8>ow<337,8>_<63>l<100,10>aa<337,10>_<63>t<100,12>iy<337,12>_<63>d<100,13>ow<337,13>_<63>r<100,15>ey<337,15>_<63>m<100,17>iy<337,17>_<63>f<100,18>aa<337,18>_<63>s<100,20>ow<337,20>_<63>l<100,22>aa<337,22>_<63>t<100,24>iy<337,24>_<63>d<100,25>ow<337,25>_<63>r<100,27>ey<337,27>_<63>m<100,29>iy<337,29>_<63>f<100,30>aa<337,30>_<63>s<100,32>ow<337,32>_<63>l<100,34>aa<337,34>_<63>t<100,36>iy<337,36>_<63>d<100,37>ow<337,37>_<63>].'

It sounds like this:

Singing a scale is hardly singing a tune, but hey, you were warned that this was a terrible guide at the outset. I hope I’ve given you a start on which you can build your own songs.

(One detail I haven’t tried yet: the older DTC-03 manual hints that singing notes can take Hz values instead of pitch numbers, and apparently loses the vibrato effect. It’s not that hard to convert from a note/octave to a frequency. Whether this still works, I don’t know.)

This post from Patrick Perdue suggested to me I had to dig into the Hz value substitution because the results are so gloriously awful. Of course, I had to write a Perl regex to make the conversions from DECtalk 1–37 sung notes to frequencies from 65–523 Hz:

perl -pwle 's|(?<=,)(\d+)(?=>)|sprintf("%.0f", 440*2**(($1-34)/12))|eg;'

(as one does). So the sung scale ends up as this non-vibrato text:

say -pre '[:PHONE ON]' -a '[d<100,65>ow<337,65>_<63>r<100,73>ey<337,73>_<63>m<100,82>iy<337,82>_<63>f<100,87>aa<337,87>_<63>s<100,98>ow<337,98>_<63>l<100,110>aa<337,110>_<63>t<100,123>iy<337,123>_<63>d<100,131>ow<337,131>_<63>r<100,147>ey<337,147>_<63>m<100,165>iy<337,165>_<63>f<100,175>aa<337,175>_<63>s<100,196>ow<337,196>_<63>l<100,220>aa<337,220>_<63>t<100,247>iy<337,247>_<63>d<100,262>ow<337,262>_<63>r<100,294>ey<337,294>_<63>m<100,330>iy<337,330>_<63>f<100,349>aa<337,349>_<63>s<100,392>ow<337,392>_<63>l<100,440>aa<337,440>_<63>t<100,494>iy<337,494>_<63>d<100,523>ow<337,523>_<63>].'

That doesn’t sound as wondrously terrible as it should, most probably as they are very small differences between each sung word. So how about we try something better? Like the refrain from The Turtles’ Happy Together, as posted on TheFlameOfHope:

say -pre '[:PHONE ON]' -a '[:nv] [:dv gn 73] [AY<400,29> KAE<200,24> N<100> T<100> SIY<400,21> MIY<400,17> LAH<200,15> VAH<125,19> N<75> NOW<400,22> BAH<200,26> DXIY<200,27> BAH<300,26> T<100> YU<600,24> FOR<200,21> AO<300,24> LX<100> MAY<400,26> LAY<900,27> F<300> _<400> WEH<300,29> N<100> YXOR<400,24> NIR<400,21> MIY<400,17> BEY<200,15> BIY<200,19> DHAX<400,22> SKAY<125,26> Z<75> WIH<125,27> LX<75> BIY<400,26> BLUW<600,24> FOR<200,21> AO<300,24> LX<100> MAY<400,26> LAY<900,27> F<300> _<300> ].'

“Refrain” is a good word, as it’s exactly what I should have done, rather than commit a terribleness on the audio by de-vibratoing it:

say -pre '[:PHONE ON]' -a '[:nv] [:dv gn 73] [AY<400,330> KAE<200,247> N<100> T<100> SIY<400,208> MIY<400,165> LAH<200,147> VAH<125,185> N<75> NOW<400,220> BAH<200,277> DXIY<200,294> BAH<300,277> T<100> YU<600,247> FOR<200,208> AO<300,247> LX<100> MAY<400,277> LAY<900,294> F<300> _<400> WEH<300,330> N<100> YXOR<400,247> NIR<400,208> MIY<400,165> BEY<200,147> BIY<200,185> DHAX<400,220> SKAY<125,277> Z<75> WIH<125,294> LX<75> BIY<400,277> BLUW<600,247> FOR<200,208> AO<300,247> LX<100> MAY<400,277> LAY<900,294> F<300> _<300> ].'

Oh dear. You can’t unhear that, can you?

[tɒk bɒks] — a tiny hardware speech synthesizer/TTS

[tÉ’k bÉ’ks]: case
[tÉ’k bÉ’ks]: case
[tÉ’k bÉ’ks]: inside
[tÉ’k bÉ’ks]: inside. The observant amongst you will notice that the speech board is 1/10″ further in than it should be for ideal alignment with the USB serial adapter.
Back in the 1980s, the now-defunct Digital Equipment Corporation (“DEC”) sold a hardware speech synthesizer based on Dennis Klatt’s research at MIT.  These DECTalk boxes were compact and robust, and — despite not having the greatest speech quality — gave valuable speech, telephone and reading accessibility to many people. Stephen Hawking’s distinctive voice is from a pre-DEC version of the MIT hardware.

DEC is long gone, and the licensing of DECTalk has wandered off into mostly software. Much to the annoyance of those in earshot, I’ve always enjoyed dabbling in speech synthesis. DECTalk hardware remains expensive, partly because of demand from electronic music producers (its vocoder-like burr is on countless tracks), but also because there are still many people who rely on it for daily life. I couldn’t justify buying a real DECTalk, but I found this: the Parallax Emic 2 Text-to-Speech Module.  For about $80, this stamp-sized board brings a hardware DECTalk implementation to embedded projects.

The Emic 2 is really marketed to microcontroller hobbyists: Make Your Arduino Speak! sorta thing. But I wanted to make a DECTalk-ish hardware box, with serial input, a speaker, and switchable headphone/line jack.  [tɒk bɒks] (a fair approximation of how I pronounce “Talk Box”) is the result.

Hardware

  • Parallax Emic 2 Text-to-Speech Module
  • OSEPP FTDI USB-Serial Breakout — there are many USB-Serial boards that would do this, but two points in this one’s favour are: i) it has header pins for breadboard use, and ii) I had a spare one.
  • Small 8Ω speaker element — the one I used is most likely a headphone element, bought from Active Surplus (RIP). This should be as small as you can get away with (and still hear) as the USB-Serial connection isn’t designed to supply audio power.
  • Header pins and sockets
  • Toggle switch
  • Small project box with perfboard
  • Jumper wires and solder

Connections

Emic 2             Serial
======             ======
 GND                GND
 5V                 Vcc
 SOUT               RXD
 SIN                TXD

Emic 2             Speaker
======             =======
 SP-                -
 SP+  (via switch)  +

Using it

You’ll need some kind of serial terminal connection. In a pinch, you can use the serial monitor that is in the Arduino development environment. Either way, identify your serial port (/dev/ttyUSBN, COMN:, or /dev/tty-usbserialNNNN) and find a way to send 9600 baud, 8N1 characters to it. Hit Return, and you should be greeted by the Emic 2’s : prompt (or a ?, followed by :). Whether you get the prompt or not depends on whether local echo is set or not. Either way, try sending this line:

SAll watched over by machines of loving grace.

You should hear a voice say the title of Richard Brautigan’s lovely poem All Watched Over by Machines of Loving Grace (caution: video link contains nekkid hippies). You should get the : prompt back once the the speech has stopped. And that’s all there is to it: send an S, followed by up to 1023 bytes of (basically ASCII) text, followed by a newline, and it will be spoken. There’s more detail, of course, in the Emic 2 documentation and the Emic 2 Epson/Fonix DECTalk 501 User’s Guide for changing voices, etc. Yes, you can make it sing. No, you probably shouldn’t, though.

Notes

  1. The Emic 2 has no serial flow control, so you have to wait until the module stops speaking (or you send it the stop command) before you can send more. The easiest way is to poll the serial port and see if there’s the : prompt waiting. Until you see the prompt, any text you send it may be lost.
  2. The Emic 2 is an embedded device; Unicode is a bit of a stretch. It’s supposed to accept ISO Latin-1 8-bit characters (handy for Spanish mode), though.
  3. Starting every speech line with S may make this board incompatible with assistive technology software such as the JAWS screen reader. I don’t think that this was the goal for Emic 2’s designers (Grand Idea Studio), however.
  4. The output from the audio jack has a fair bit of noise on it, and you need to set the volume quite low to avoid hiss and hum. Your experience may be different, as I may have accidentally made a ground loop. There is a faintly  audible click at the start and end of the text, too.
  5. The Emic 2 uses DECTalk v5 commands and phonemes. Many DECTalk resources on the web (like these songs) use v4 or older, which are subtly incompatible. I haven’t found a reliable conversion protocol yet.

To end, here’s the Emic 2’s “Dennis” voice reading all of Brautigan’s All Watched Over By Machines of Loving Grace:


(plain link: molg-dennis-140wpm-16khz.mp3)

(even plainer link if you can’t decode MP2 files: molg-dennis-140wpm.mp3)

(recorded and edited for length with Audacity. No hippies — nekkid, or otherwise — were harmed in the making of this recording.)