Of all the niche blog entries I’ve written, this must be the nichest. I don’t even like the topic I’m writing about. But I’ve worked it out, and there seems to be a shortage of documented solutions.
For the both of you that generate Rich Text Format (RTF) documents by hand, you might be wondering how RTF converts ‘💩’ (that’s code point U+1F4A9) to the seemingly nonsensical \u-10179?\u-9047?. It seems that RTF imposes two encoding limitations on characters: firstly, everything must be in 7-bit ASCII for easy transmission, and secondly, it uses the somewhat old-fashioned UTF-16 representation for non-ASCII characters.
UTF-16 grew out of an early standard, UCS-2, that was all like “Hey, there will never be a Unicode code point above 65536, so we can hard code the characters in two bytes … oh shiiii…â€. So not merely does it have to escape emoji code points down to two bytes using a very dank scheme indeed, it then has to further escape everything to ASCII. That’s how your single emoji becomes 17 bytes in an RTF document.
So here’s a tiny subroutine to do the conversion. I wrote it in Perl, but it doesn’t do anything Perl-specific:
#!/usr/bin/env -S perl -CAS
# emoji2rtf - 2017 - scruss
# See UTF-16 decoder for the dank details
# <https://en.wikipedia.org/wiki/UTF-16>
# run with 'perl -CAS ...' or set PERL_UNICODE to 'AS' for UTF-8 argv
# doesn't work from Windows cmd prompt because Windows ¯\_(ツ)_/¯
# https://scruss.com/blog/2017/03/12/in-the-unlikely-event-you-need-to-represent-emoji-in-rtf-using-perl/
use v5.20;
use strict;
use warnings qw( FATAL utf8 );
use utf8;
use open qw( :encoding(UTF-8) :std );
sub emoji2rtf($);
my $c = substr( $ARGV[0], 0, 1 );
say join( "\t⇒ ", $c, sprintf( "U+%X", ord($c) ), emoji2rtf($c) );
exit;
sub emoji2rtf($) {
my $n = ord( substr( shift, 0, 1 ) );
die "emoji2rtf: code must be >= 65536\n" if ( $n < 0x10000 );
return sprintf( "\\u%d?\\u%d?",
0xd800 + ( ( $n - 0x10000 ) & 0xffc00 ) / 0x400 - 0x10000,
0xdC00 + ( ( $n - 0x10000 ) & 0x3ff ) - 0x10000 );
}
This will take any emoji fed to it as a command line argument and spits out the RTF code:
📓 ⇒ U+1F4D3 ⇒ \u-10179?\u-9005? 💽 ⇒ U+1F4BD ⇒ \u-10179?\u-9027? 🗽 ⇒ U+1F5FD ⇒ \u-10179?\u-8707? 😱 ⇒ U+1F631 ⇒ \u-10179?\u-8655? 🙌 ⇒ U+1F64C ⇒ \u-10179?\u-8628? 🙟 ⇒ U+1F65F ⇒ \u-10179?\u-8609? 🙯 ⇒ U+1F66F ⇒ \u-10179?\u-8593? 🚥 ⇒ U+1F6A5 ⇒ \u-10179?\u-8539? 🚵 ⇒ U+1F6B5 ⇒ \u-10179?\u-8523? 🛅 ⇒ U+1F6C5 ⇒ \u-10179?\u-8507? 💨 ⇒ U+1F4A8 ⇒ \u-10179?\u-9048? 💩 ⇒ U+1F4A9 ⇒ \u-10179?\u-9047? 💪 ⇒ U+1F4AA ⇒ \u-10179?\u-9046?
Just to show that this encoding scheme really is correct, I made a tiny test RTF file unicode-emoji.rtf that looked like this in Google Docs on my desktop:
It looks a bit better on my phone, but there are still a couple of glyphs that won’t render:
Update, 2020-07: something has changed in the Unicode handling, so I’ve modified the code to expect arguments and stdio in UTF-8. Thanks to Piyush Jain for noticing this little piece of bitrot.
Further update: Windows command prompt does bad things to arguments in Unicode, so this script won’t work. Strawberry Perl gives me:
perl -CAS .\emoji2rtf.pl ☺
emoji2rtf: code must be >= 65536; saw 63
I have no interest in finding out why.