{"id":13831,"date":"2017-03-12T22:17:47","date_gmt":"2017-03-13T02:17:47","guid":{"rendered":"http:\/\/scruss.com\/blog\/?p=13831"},"modified":"2023-08-02T17:55:31","modified_gmt":"2023-08-02T21:55:31","slug":"in-the-unlikely-event-you-need-to-represent-emoji-in-rtf-using-perl","status":"publish","type":"post","link":"https:\/\/scruss.com\/blog\/2017\/03\/12\/in-the-unlikely-event-you-need-to-represent-emoji-in-rtf-using-perl\/","title":{"rendered":"In the unlikely event you need to represent Emoji in RTF using Perl \u00e2\u20ac\u00a6"},"content":{"rendered":"\n<p>Of all the niche blog entries I&#8217;ve written, this must be the nichest. I don&#8217;t even <em>like <\/em>the topic I&#8217;m writing about. But I&#8217;ve worked it out, and there seems to be a shortage of documented solutions.<\/p>\n\n\n\n<p>For the both of you that generate <a href=\"https:\/\/en.wikipedia.org\/wiki\/Rich_Text_Format\">Rich Text Format<\/a> (RTF) documents by hand, you might be wondering how RTF converts \u00e2\u20ac\u02dc\u00f0\u0178\u2019\u00a9\u00e2\u20ac\u2122 (that&#8217;s code point U+1F4A9) to the seemingly nonsensical <strong>\\u-10179?\\u-9047?<\/strong>. It seems that RTF imposes two encoding limitations on characters: firstly, everything must be in 7-bit ASCII for easy transmission, and secondly, it uses the somewhat old-fashioned <a href=\"https:\/\/en.wikipedia.org\/wiki\/UTF-16\">UTF-16<\/a> representation for non-ASCII characters.<\/p>\n\n\n\n<p>UTF-16 grew out of an early standard, UCS-2, that was all like \u00e2\u20ac\u0153<em>Hey, there will never be a Unicode code point above 65536, so we can hard code the characters in two bytes \u00e2\u20ac\u00a6 oh shiiii\u00e2\u20ac\u00a6<\/em>\u00e2\u20ac\u009d. So not merely does it have to escape emoji code points down to two bytes using <a href=\"https:\/\/en.wikipedia.org\/wiki\/UTF-16#Code_points_from_U+010000_to_U+10FFFF\">a very dank scheme indeed<\/a>, it then has to further escape everything to ASCII. That&#8217;s how your single emoji becomes 17 bytes in an RTF document.<\/p>\n\n\n\n<p>So here&#8217;s a tiny subroutine to do the conversion. I wrote it in Perl, but it doesn&#8217;t do anything Perl-specific:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: perl; title: ; notranslate\" title=\"\">\n#!\/usr\/bin\/env -S perl -CAS\n# emoji2rtf - 2017 - scruss\n# See UTF-16 decoder for the dank details\n#  &lt;https:\/\/en.wikipedia.org\/wiki\/UTF-16&gt;\n# run with &#039;perl -CAS ...&#039; or set PERL_UNICODE to &#039;AS&#039; for UTF-8 argv\n# doesn&#039;t work from Windows cmd prompt because Windows \u00c2\u00af\\_(\u00e3\u0192\u201e)_\/\u00c2\u00af\n# https:\/\/scruss.com\/blog\/2017\/03\/12\/in-the-unlikely-event-you-need-to-represent-emoji-in-rtf-using-perl\/\n\nuse v5.20;\nuse strict;\nuse warnings qw( FATAL utf8 );\nuse utf8;\nuse open qw( :encoding(UTF-8) :std );\nsub emoji2rtf($);\n\nmy $c = substr( $ARGV&#x5B;0], 0, 1 );\nsay join( &quot;\\t\u00e2\u2021\u2019 &quot;, $c, sprintf( &quot;U+%X&quot;, ord($c) ), emoji2rtf($c) );\nexit;\n\nsub emoji2rtf($) {\n    my $n = ord( substr( shift, 0, 1 ) );\n    die &quot;emoji2rtf: code must be &gt;= 65536\\n&quot; if ( $n &lt; 0x10000 );\n    return sprintf( &quot;\\\\u%d?\\\\u%d?&quot;,\n        0xd800 + ( ( $n - 0x10000 ) &amp; 0xffc00 ) \/ 0x400 - 0x10000,\n        0xdC00 + ( ( $n - 0x10000 ) &amp; 0x3ff ) - 0x10000 );\n}\n\n<\/pre><\/div>\n\n\n<p>This will take any emoji fed to it as a command line argument and spits out the RTF code:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">\u00f0\u0178\u201c\u201c\t\u00e2\u2021\u2019 U+1F4D3\t\u00e2\u2021\u2019 \\u-10179?\\u-9005?\n\u00f0\u0178\u2019\u00bd\t\u00e2\u2021\u2019 U+1F4BD\t\u00e2\u2021\u2019 \\u-10179?\\u-9027?\n\u00f0\u0178\u2014\u00bd\t\u00e2\u2021\u2019 U+1F5FD\t\u00e2\u2021\u2019 \\u-10179?\\u-8707?\n\u00f0\u0178\u02dc\u00b1\t\u00e2\u2021\u2019 U+1F631\t\u00e2\u2021\u2019 \\u-10179?\\u-8655?\n\u00f0\u0178\u2122\u0152\t\u00e2\u2021\u2019 U+1F64C\t\u00e2\u2021\u2019 \\u-10179?\\u-8628?\n\u00f0\u0178\u2122\u0178\t\u00e2\u2021\u2019 U+1F65F\t\u00e2\u2021\u2019 \\u-10179?\\u-8609?\n\u00f0\u0178\u2122\u00af\t\u00e2\u2021\u2019 U+1F66F\t\u00e2\u2021\u2019 \\u-10179?\\u-8593?\n\u00f0\u0178\u0161\u00a5\t\u00e2\u2021\u2019 U+1F6A5\t\u00e2\u2021\u2019 \\u-10179?\\u-8539?\n\u00f0\u0178\u0161\u00b5\t\u00e2\u2021\u2019 U+1F6B5\t\u00e2\u2021\u2019 \\u-10179?\\u-8523?\n\u00f0\u0178\u203a\u2026\t\u00e2\u2021\u2019 U+1F6C5\t\u00e2\u2021\u2019 \\u-10179?\\u-8507?\n\u00f0\u0178\u2019\u00a8\t\u00e2\u2021\u2019 U+1F4A8\t\u00e2\u2021\u2019 \\u-10179?\\u-9048?\n\u00f0\u0178\u2019\u00a9\t\u00e2\u2021\u2019 U+1F4A9\t\u00e2\u2021\u2019 \\u-10179?\\u-9047?\n\u00f0\u0178\u2019\u00aa\t\u00e2\u2021\u2019 U+1F4AA\t\u00e2\u2021\u2019 \\u-10179?\\u-9046?\n<\/pre>\n\n\n\n<p>Just to show that this encoding scheme really is correct, I made a tiny test RTF file <a href=\"http:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2017\/03\/unicode-emoji.rtf\">unicode-emoji.rtf<\/a> that looked like this in Google Docs on my desktop:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><a href=\"http:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2017\/03\/Screenshot-from-2017-03-12-22-10-07.png\"><img loading=\"lazy\" decoding=\"async\" width=\"350\" height=\"48\" src=\"http:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2017\/03\/Screenshot-from-2017-03-12-22-10-07.png\" alt=\"\" class=\"wp-image-13834\" srcset=\"https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2017\/03\/Screenshot-from-2017-03-12-22-10-07.png 350w, https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2017\/03\/Screenshot-from-2017-03-12-22-10-07-160x22.png 160w, https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2017\/03\/Screenshot-from-2017-03-12-22-10-07-320x44.png 320w\" sizes=\"auto, (max-width: 350px) 100vw, 350px\" \/><\/a><\/figure>\n\n\n\n<p>It looks a bit better on my phone, but there are still a couple of glyphs that won&#8217;t render:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><a href=\"http:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2017\/03\/2017-03-13-02.14.07.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1599\" height=\"110\" src=\"http:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2017\/03\/2017-03-13-02.14.07.png\" alt=\"\" class=\"wp-image-13835\" srcset=\"https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2017\/03\/2017-03-13-02.14.07.png 1599w, https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2017\/03\/2017-03-13-02.14.07-160x11.png 160w, https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2017\/03\/2017-03-13-02.14.07-320x22.png 320w, https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2017\/03\/2017-03-13-02.14.07-768x53.png 768w, https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2017\/03\/2017-03-13-02.14.07-1024x70.png 1024w, https:\/\/scruss.com\/wordpress\/wp-content\/uploads\/2017\/03\/2017-03-13-02.14.07-1200x83.png 1200w\" sizes=\"auto, (max-width: 1599px) 100vw, 1599px\" \/><\/a><\/figure>\n\n\n\n<p><br><strong>Update, 2020-07<\/strong>: something has changed in the Unicode handling, so I&#8217;ve modified the code to expect arguments and stdio in UTF-8. Thanks to Piyush Jain for noticing this little piece of bitrot.<\/p>\n\n\n\n<p><strong>Further update<\/strong>: Windows command prompt does bad things to arguments in Unicode, so this script won&#8217;t work. <a href=\"http:\/\/strawberryperl.com\/\">Strawberry Perl<\/a> gives me:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">perl -CAS .\\emoji2rtf.pl \u00e2\u02dc\u00ba<br>emoji2rtf: code must be &gt;= 65536; saw 63<\/pre>\n\n\n\n<p>I have no interest in finding out why.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Of all the niche blog entries I&#8217;ve written, this must be the nichest. I don&#8217;t even like the topic I&#8217;m writing about. But I&#8217;ve worked it out, and there seems to be a shortage of documented solutions. For the both of you that generate Rich Text Format (RTF) documents by hand, you might be wondering [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[7],"tags":[3046,2405,187,3045],"class_list":["post-13831","post","type-post","status-publish","format-standard","hentry","category-computers-suck","tag-dank","tag-emoji","tag-perl","tag-rtf"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/pQNZZ-3B5","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/scruss.com\/blog\/wp-json\/wp\/v2\/posts\/13831","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/scruss.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/scruss.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/scruss.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/scruss.com\/blog\/wp-json\/wp\/v2\/comments?post=13831"}],"version-history":[{"count":7,"href":"https:\/\/scruss.com\/blog\/wp-json\/wp\/v2\/posts\/13831\/revisions"}],"predecessor-version":[{"id":17430,"href":"https:\/\/scruss.com\/blog\/wp-json\/wp\/v2\/posts\/13831\/revisions\/17430"}],"wp:attachment":[{"href":"https:\/\/scruss.com\/blog\/wp-json\/wp\/v2\/media?parent=13831"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/scruss.com\/blog\/wp-json\/wp\/v2\/categories?post=13831"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/scruss.com\/blog\/wp-json\/wp\/v2\/tags?post=13831"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}