Tag: rtf

  • In the unlikely event you need to represent Emoji in RTF using Perl …

    Of all the niche blog entries I’ve written, this must be the nichest. I don’t even like the topic I’m writing about. But I’ve worked it out, and there seems to be a shortage of documented solutions.

    For the both of you that generate Rich Text Format (RTF) documents by hand, you might be wondering how RTF converts ‘💩’ (that’s code point U+1F4A9) to the seemingly nonsensical \u-10179?\u-9047?. It seems that RTF imposes two encoding limitations on characters: firstly, everything must be in 7-bit ASCII for easy transmission, and secondly, it uses the somewhat old-fashioned UTF-16 representation for non-ASCII characters.

    UTF-16 grew out of an early standard, UCS-2, that was all like “Hey, there will never be a Unicode code point above 65536, so we can hard code the characters in two bytes … oh shiiii…”. So not merely does it have to escape emoji code points down to two bytes using a very dank scheme indeed, it then has to further escape everything to ASCII. That’s how your single emoji becomes 17 bytes in an RTF document.

    So here’s a tiny subroutine to do the conversion. I wrote it in Perl, but it doesn’t do anything Perl-specific:

    #!/usr/bin/env -S perl -CAS
    # emoji2rtf - 2017 - scruss
    # See UTF-16 decoder for the dank details
    #  <https://en.wikipedia.org/wiki/UTF-16>
    # run with 'perl -CAS ...' or set PERL_UNICODE to 'AS' for UTF-8 argv
    # doesn't work from Windows cmd prompt because Windows ¯\_(ツ)_/¯
    # https://scruss.com/blog/2017/03/12/in-the-unlikely-event-you-need-to-represent-emoji-in-rtf-using-perl/
    
    use v5.20;
    use strict;
    use warnings qw( FATAL utf8 );
    use utf8;
    use open qw( :encoding(UTF-8) :std );
    sub emoji2rtf($);
    
    my $c = substr( $ARGV[0], 0, 1 );
    say join( "\t⇒ ", $c, sprintf( "U+%X", ord($c) ), emoji2rtf($c) );
    exit;
    
    sub emoji2rtf($) {
        my $n = ord( substr( shift, 0, 1 ) );
        die "emoji2rtf: code must be >= 65536\n" if ( $n < 0x10000 );
        return sprintf( "\\u%d?\\u%d?",
            0xd800 + ( ( $n - 0x10000 ) & 0xffc00 ) / 0x400 - 0x10000,
            0xdC00 + ( ( $n - 0x10000 ) & 0x3ff ) - 0x10000 );
    }
    
    

    This will take any emoji fed to it as a command line argument and spits out the RTF code:

    📓	⇒ U+1F4D3	⇒ \u-10179?\u-9005?
    💽	⇒ U+1F4BD	⇒ \u-10179?\u-9027?
    🗽	⇒ U+1F5FD	⇒ \u-10179?\u-8707?
    😱	⇒ U+1F631	⇒ \u-10179?\u-8655?
    🙌	⇒ U+1F64C	⇒ \u-10179?\u-8628?
    🙟	⇒ U+1F65F	⇒ \u-10179?\u-8609?
    🙯	⇒ U+1F66F	⇒ \u-10179?\u-8593?
    🚥	⇒ U+1F6A5	⇒ \u-10179?\u-8539?
    🚵	⇒ U+1F6B5	⇒ \u-10179?\u-8523?
    🛅	⇒ U+1F6C5	⇒ \u-10179?\u-8507?
    💨	⇒ U+1F4A8	⇒ \u-10179?\u-9048?
    💩	⇒ U+1F4A9	⇒ \u-10179?\u-9047?
    💪	⇒ U+1F4AA	⇒ \u-10179?\u-9046?
    

    Just to show that this encoding scheme really is correct, I made a tiny test RTF file unicode-emoji.rtf that looked like this in Google Docs on my desktop:

    It looks a bit better on my phone, but there are still a couple of glyphs that won’t render:


    Update, 2020-07: something has changed in the Unicode handling, so I’ve modified the code to expect arguments and stdio in UTF-8. Thanks to Piyush Jain for noticing this little piece of bitrot.

    Further update: Windows command prompt does bad things to arguments in Unicode, so this script won’t work. Strawberry Perl gives me:

    perl -CAS .\emoji2rtf.pl ☺
    emoji2rtf: code must be >= 65536; saw 63

    I have no interest in finding out why.