In the unlikely event you need to represent Emoji in RTF using Perl …

Of all the niche blog entries I’ve written, this must be the nichest. I don’t even like the topic I’m writing about. But I’ve worked it out, and there seems to be a shortage of documented solutions.

For the both of you that generate Rich Text Format (RTF) documents by hand, you might be wondering how RTF converts ‘💩’ (that’s code point U+1F4A9) to the seemingly nonsensical \u-10179?\u-9047?. It seems that RTF imposes two encoding limitations on characters: firstly, everything must be in 7-bit ASCII for easy transmission, and secondly, it uses the somewhat old-fashioned UTF-16 representation for non-ASCII characters.

UTF-16 grew out of an early standard, UCS-2, that was all like “Hey, there will never be a Unicode cope point above 65536, so we can hard code the characters in two bytes … oh shiiii…”. So not merely does it have to escape emoji code points down to two bytes using a very dank scheme indeed, it then has to further escape everything to ASCII. That’s how your single enoji becomes 17 bytes in an RTF document.

So here’s a tiny subroutine to do the conversion. I wrote it in Perl, but it doesn’t do anything Perl-specific:

#!/usr/bin/env perl
# emoji2rtf - 2017 - scruss
# See UTF-16 decoder for the dank details
#  <>

use v5.20;
use strict;
use warnings;
use utf8;
sub emoji2rtf($);

my $c = substr( $ARGV[0], 0, 1 );
say join( "\t⇒ ", $c, sprintf( "U+%X", ord($c) ), emoji2rtf($c) );

sub emoji2rtf($) {
    my $n = ord( substr( shift, 0, 1 ) );
    die "emoji2rtf: code must be >= 65536\n" if ( $n < 0x10000 );
    return sprintf( "\\u%d?\\u%d?",
        0xd800 + ( ( $n - 0x10000 ) & 0xffc00 ) / 0x400 - 0x10000,
        0xdC00 + ( ( $n - 0x10000 ) & 0x3ff ) - 0x10000 );

This will take any emoji fed to it and spit out the RTF code:

📓	⇒ U+1F4D3	⇒ \u-10179?\u-9005?
💽	⇒ U+1F4BD	⇒ \u-10179?\u-9027?
🗽	⇒ U+1F5FD	⇒ \u-10179?\u-8707?
😱	⇒ U+1F631	⇒ \u-10179?\u-8655?
🙌	⇒ U+1F64C	⇒ \u-10179?\u-8628?
🙟	⇒ U+1F65F	⇒ \u-10179?\u-8609?
🙯	⇒ U+1F66F	⇒ \u-10179?\u-8593?
🚥	⇒ U+1F6A5	⇒ \u-10179?\u-8539?
🚵	⇒ U+1F6B5	⇒ \u-10179?\u-8523?
🛅	⇒ U+1F6C5	⇒ \u-10179?\u-8507?
💨	⇒ U+1F4A8	⇒ \u-10179?\u-9048?
💩	⇒ U+1F4A9	⇒ \u-10179?\u-9047?
💪	⇒ U+1F4AA	⇒ \u-10179?\u-9046?

Just to show that this encoding scheme really is correct, I made a tiny test RTF file unicode-emoji.rtf that looked like this in Google Docs on my desktop:

It looks a bit better on my phone, but there are still a couple of glyphs that won’t render:

Leave a Reply

Your email address will not be published. Required fields are marked *