In the unlikely event you need to represent Emoji in RTF using Perl â€¦

Of all the niche blog entries I’ve written, this must be the nichest. I don’t even like the topic I’m writing about. But I’ve worked it out, and there seems to be a shortage of documented solutions.

For the both of you that generate Rich Text Format (RTF) documents by hand, you might be wondering how RTF converts â€˜ðŸ’©â€™ (that’s code point U+1F4A9) to the seemingly nonsensical \u-10179?\u-9047?. It seems that RTF imposes two encoding limitations on characters: firstly, everything must be in 7-bit ASCII for easy transmission, and secondly, it uses the somewhat old-fashioned UTF-16 representation for non-ASCII characters.

UTF-16 grew out of an early standard, UCS-2, that was all like â€œHey, there will never be a Unicode code point above 65536, so we can hard code the characters in two bytes â€¦ oh shiiiiâ€¦â€. So not merely does it have to escape emoji code points down to two bytes using a very dank scheme indeed, it then has to further escape everything to ASCII. That’s how your single emoji becomes 17 bytes in an RTF document.

So here’s a tiny subroutine to do the conversion. I wrote it in Perl, but it doesn’t do anything Perl-specific:

#!/usr/bin/env -S perl -CAS
# emoji2rtf - 2017 - scruss
# See UTF-16 decoder for the dank details
#  <https://en.wikipedia.org/wiki/UTF-16>
# run with 'perl -CAS ...' or set PERL_UNICODE to 'AS' for UTF-8 argv
# doesn't work from Windows cmd prompt because Windows Â¯\_(ãƒ„)_/Â¯
# https://scruss.com/blog/2017/03/12/in-the-unlikely-event-you-need-to-represent-emoji-in-rtf-using-perl/

use v5.20;
use strict;
use warnings qw( FATAL utf8 );
use utf8;
use open qw( :encoding(UTF-8) :std );
sub emoji2rtf($);

my $c = substr( $ARGV[0], 0, 1 );
say join( "\tâ‡’ ", $c, sprintf( "U+%X", ord($c) ), emoji2rtf($c) );
exit;

sub emoji2rtf($) {
    my $n = ord( substr( shift, 0, 1 ) );
    die "emoji2rtf: code must be >= 65536\n" if ( $n < 0x10000 );
    return sprintf( "\\u%d?\\u%d?",
        0xd800 + ( ( $n - 0x10000 ) & 0xffc00 ) / 0x400 - 0x10000,
        0xdC00 + ( ( $n - 0x10000 ) & 0x3ff ) - 0x10000 );
}

This will take any emoji fed to it as a command line argument and spits out the RTF code:

ðŸ““	â‡’ U+1F4D3	â‡’ \u-10179?\u-9005?
ðŸ’½	â‡’ U+1F4BD	â‡’ \u-10179?\u-9027?
ðŸ—½	â‡’ U+1F5FD	â‡’ \u-10179?\u-8707?
ðŸ˜±	â‡’ U+1F631	â‡’ \u-10179?\u-8655?
ðŸ™Œ	â‡’ U+1F64C	â‡’ \u-10179?\u-8628?
ðŸ™Ÿ	â‡’ U+1F65F	â‡’ \u-10179?\u-8609?
ðŸ™¯	â‡’ U+1F66F	â‡’ \u-10179?\u-8593?
ðŸš¥	â‡’ U+1F6A5	â‡’ \u-10179?\u-8539?
ðŸšµ	â‡’ U+1F6B5	â‡’ \u-10179?\u-8523?
ðŸ›…	â‡’ U+1F6C5	â‡’ \u-10179?\u-8507?
ðŸ’¨	â‡’ U+1F4A8	â‡’ \u-10179?\u-9048?
ðŸ’©	â‡’ U+1F4A9	â‡’ \u-10179?\u-9047?
ðŸ’ª	â‡’ U+1F4AA	â‡’ \u-10179?\u-9046?

Just to show that this encoding scheme really is correct, I made a tiny test RTF file unicode-emoji.rtf that looked like this in Google Docs on my desktop:

It looks a bit better on my phone, but there are still a couple of glyphs that won’t render:

Update, 2020-07: something has changed in the Unicode handling, so I’ve modified the code to expect arguments and stdio in UTF-8. Thanks to Piyush Jain for noticing this little piece of bitrot.

Further update: Windows command prompt does bad things to arguments in Unicode, so this script won’t work. Strawberry Perl gives me:

perl -CAS .\emoji2rtf.pl â˜º
emoji2rtf: code must be >= 65536; saw 63

I have no interest in finding out why.

4 comments

Kees van Spelde says:

02018-08-03 at 13:00

I’m not a star in Perl and I need to decode RTF to unicode… so the big question. How do I do this in C#?

I need to convert this –> \u-10180 ?\u-8311 ? to this –> ðŸŽ‰
scruss says:

02018-08-05 at 19:38

I have no idea.
Max K says:

02023-08-01 at 22:17

I’m writing an RTF parser in Python and this saved me! I could not figure out what the hell was going on when encoding chars above 0xffff. Thank you for deciding to write this very niche post!
scruss says:

02023-08-02 at 17:53

I’m immensely glad that someone finally found this useful.

Sorry about the messed-up encoding. Somewhat ironic, no?

4 comments

Leave a comment