In the unlikely event you need to represent Emoji in RTF using Perl …

Of all the niche blog entries I’ve written, this must be the nichest. I don’t even like the topic I’m writing about. But I’ve worked it out, and there seems to be a shortage of documented solutions.

For the both of you that generate Rich Text Format (RTF) documents by hand, you might be wondering how RTF converts ‘💩’ (that’s code point U+1F4A9) to the seemingly nonsensical \u-10179?\u-9047?. It seems that RTF imposes two encoding limitations on characters: firstly, everything must be in 7-bit ASCII for easy transmission, and secondly, it uses the somewhat old-fashioned UTF-16 representation for non-ASCII characters.

UTF-16 grew out of an early standard, UCS-2, that was all like “Hey, there will never be a Unicode cope point above 65536, so we can hard code the characters in two bytes … oh shiiii…”. So not merely does it have to escape emoji code points down to two bytes using a very dank scheme indeed, it then has to further escape everything to ASCII. That’s how your single enoji becomes 17 bytes in an RTF document.

So here’s a tiny subroutine to do the conversion. I wrote it in Perl, but it doesn’t do anything Perl-specific:

#!/usr/bin/env perl
# emoji2rtf - 2017 - scruss
# See UTF-16 decoder for the dank details
#  <https://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF>

use v5.20;
use strict;
use warnings;
use utf8;
sub emoji2rtf($);

my $c = substr( $ARGV[0], 0, 1 );
say join( "\t⇒ ", $c, sprintf( "U+%X", ord($c) ), emoji2rtf($c) );
exit;

sub emoji2rtf($) {
    my $n = ord( substr( shift, 0, 1 ) );
    die "emoji2rtf: code must be >= 65536\n" if ( $n < 0x10000 );
    return sprintf( "\\u%d?\\u%d?",
        0xd800 + ( ( $n - 0x10000 ) & 0xffc00 ) / 0x400 - 0x10000,
        0xdC00 + ( ( $n - 0x10000 ) & 0x3ff ) - 0x10000 );
}

This will take any emoji fed to it and spit out the RTF code:

📓	⇒ U+1F4D3	⇒ \u-10179?\u-9005?
💽	⇒ U+1F4BD	⇒ \u-10179?\u-9027?
🗽	⇒ U+1F5FD	⇒ \u-10179?\u-8707?
😱	⇒ U+1F631	⇒ \u-10179?\u-8655?
🙌	⇒ U+1F64C	⇒ \u-10179?\u-8628?
🙟	⇒ U+1F65F	⇒ \u-10179?\u-8609?
🙯	⇒ U+1F66F	⇒ \u-10179?\u-8593?
🚥	⇒ U+1F6A5	⇒ \u-10179?\u-8539?
🚵	⇒ U+1F6B5	⇒ \u-10179?\u-8523?
🛅	⇒ U+1F6C5	⇒ \u-10179?\u-8507?
💨	⇒ U+1F4A8	⇒ \u-10179?\u-9048?
💩	⇒ U+1F4A9	⇒ \u-10179?\u-9047?
💪	⇒ U+1F4AA	⇒ \u-10179?\u-9046?

Just to show that this encoding scheme really is correct, I made a tiny test RTF file unicode-emoji.rtf that looked like this in Google Docs on my desktop:

It looks a bit better on my phone, but there are still a couple of glyphs that won’t render:

The Gadsden Flag Nouveau

The Gadsden Flag Nouveau

Instagram filter used: Normal

View in Instagram ⇒

I made this after being inspired by this MetaFilter comment. There’s a PDF linked from the image below:

gadsden_nouveau

(Not Just) Firefox’s “Pile of Poo” Easter Egg: 💩

For a reason best known to the Unicode consortium, there is now the symbol U+1F4A9 “Pile of Poo”: 💩. If you happen to create a web page with this delightful character in the title, Firefox does something special:

Yep, that’s a smiley face poo, a bit like Mr Hankey. Oh dear.

Actually, it seems it might be an OS X Emoji thing, because Safari renders it in the title like that, and in the text as (enlarged to show texture):

iOS has it covered too:

Blackberry’s browser just shows a small black square. Android, rather sensibly, shows an empty square. It must be an Apple thing.

“Thanks” go to tchrist‘s comment in unicode – Why does modern Perl avoid UTF-8 by default? for alerting me to this character, and letting us know about the Symbola font that supports it. Yeah, cheers Tom …