In the unlikely event you need to represent Emoji in RTF using Perl …

Of all the niche blog entries I’ve written, this must be the nichest. I don’t even like the topic I’m writing about. But I’ve worked it out, and there seems to be a shortage of documented solutions.

For the both of you that generate Rich Text Format (RTF) documents by hand, you might be wondering how RTF converts ‘💩’ (that’s code point U+1F4A9) to the seemingly nonsensical \u-10179?\u-9047?. It seems that RTF imposes two encoding limitations on characters: firstly, everything must be in 7-bit ASCII for easy transmission, and secondly, it uses the somewhat old-fashioned UTF-16 representation for non-ASCII characters.

UTF-16 grew out of an early standard, UCS-2, that was all like “Hey, there will never be a Unicode code point above 65536, so we can hard code the characters in two bytes … oh shiiii…”. So not merely does it have to escape emoji code points down to two bytes using a very dank scheme indeed, it then has to further escape everything to ASCII. That’s how your single emoji becomes 17 bytes in an RTF document.

So here’s a tiny subroutine to do the conversion. I wrote it in Perl, but it doesn’t do anything Perl-specific:

#!/usr/bin/env -S perl -CAS
# emoji2rtf - 2017 - scruss
# See UTF-16 decoder for the dank details
#  <https://en.wikipedia.org/wiki/UTF-16>
# run with 'perl -CAS ...' or set PERL_UNICODE to 'AS' for UTF-8 argv
# doesn't work from Windows cmd prompt because Windows ¯\_(ツ)_/¯
# https://scruss.com/blog/2017/03/12/in-the-unlikely-event-you-need-to-represent-emoji-in-rtf-using-perl/

use v5.20;
use strict;
use warnings qw( FATAL utf8 );
use utf8;
use open qw( :encoding(UTF-8) :std );
sub emoji2rtf($);

my $c = substr( $ARGV[0], 0, 1 );
say join( "\t⇒ ", $c, sprintf( "U+%X", ord($c) ), emoji2rtf($c) );
exit;

sub emoji2rtf($) {
    my $n = ord( substr( shift, 0, 1 ) );
    die "emoji2rtf: code must be >= 65536\n" if ( $n < 0x10000 );
    return sprintf( "\\u%d?\\u%d?",
        0xd800 + ( ( $n - 0x10000 ) & 0xffc00 ) / 0x400 - 0x10000,
        0xdC00 + ( ( $n - 0x10000 ) & 0x3ff ) - 0x10000 );
}

This will take any emoji fed to it as a command line argument and spits out the RTF code:

📓	⇒ U+1F4D3	⇒ \u-10179?\u-9005?
💽	⇒ U+1F4BD	⇒ \u-10179?\u-9027?
🗽	⇒ U+1F5FD	⇒ \u-10179?\u-8707?
😱	⇒ U+1F631	⇒ \u-10179?\u-8655?
🙌	⇒ U+1F64C	⇒ \u-10179?\u-8628?
🙟	⇒ U+1F65F	⇒ \u-10179?\u-8609?
🙯	⇒ U+1F66F	⇒ \u-10179?\u-8593?
🚥	⇒ U+1F6A5	⇒ \u-10179?\u-8539?
🚵	⇒ U+1F6B5	⇒ \u-10179?\u-8523?
🛅	⇒ U+1F6C5	⇒ \u-10179?\u-8507?
💨	⇒ U+1F4A8	⇒ \u-10179?\u-9048?
💩	⇒ U+1F4A9	⇒ \u-10179?\u-9047?
💪	⇒ U+1F4AA	⇒ \u-10179?\u-9046?

Just to show that this encoding scheme really is correct, I made a tiny test RTF file unicode-emoji.rtf that looked like this in Google Docs on my desktop:

It looks a bit better on my phone, but there are still a couple of glyphs that won’t render:


Update, 2020-07: something has changed in the Unicode handling, so I’ve modified the code to expect arguments and stdio in UTF-8. Thanks to Piyush Jain for noticing this little piece of bitrot.

Further update: Windows command prompt does bad things to arguments in Unicode, so this script won’t work. Strawberry Perl gives me:

perl -CAS .\emoji2rtf.pl ☺
emoji2rtf: code must be >= 65536; saw 63

I have no interest in finding out why.

4 comments

  1. I’m not a star in Perl and I need to decode RTF to unicode… so the big question. How do I do this in C#?

    I need to convert this –> \u-10180 ?\u-8311 ? to this –> 🎉

  2. I’m writing an RTF parser in Python and this saved me! I could not figure out what the hell was going on when encoding chars above 0xffff. Thank you for deciding to write this very niche post!

  3. I’m immensely glad that someone finally found this useful.

    Sorry about the messed-up encoding. Somewhat ironic, no?

Leave a comment

Your email address will not be published. Required fields are marked *