Squid Links - Issue #070 2018-08-12

#070: The Emoji Code

Text on computers seems to be something straightforward. You need a way to identify each character: one for each letter, one upper and lower case, and a few more for various symbols. Since computers only work with numbers, you assign each letter and symbol a number. And that is, in fact, how text used to work for a long time. Computers in the US got by with such a set of characters called ASCII, which only used the numbers between 1 and 128. In Europe, things were a bit more complicated, but by using numbers up to 256, various different characters from European languages could be accommodated. This made it hard for computers of different countries to talk to each other (“Does the number 188 represent the character ‘¼’ or ‘ź’?”), but this was only a problem of very few people.

To write Japanese or Chinese, though, you need way more than 256 different characters. So, these countries too came up with their own character sets, using two bytes for each character. This means you can use the numbers between 1 and 65,536 to assign to characters. It also meant that exchanging text between western and eastern programs was a bit difficult, but again, most people never encountered this problem in the first place.

And then the internet came along, and people started sending all kinds of texts. And sometimes, they saw weird characters in places where they clearly didn’t belong. The problem is that text in a computer doesn’t know what character set was used to write it. It is just a stream of numbers, and the character set is the key to get the right characters for each numbers. If you use the wrong key, you get gibberish.

Quite a few people foresaw this problem, and sought out to solve it. And while they were at it, they decided to solve it in a way that it would allow any and all languages, past and present, to be stored in a way that allowed them to be read correctly later. The universally coded character set, or Unicode, was born.

Simply put, Unicode provides a system to assign numbers (called code points) for each and every character known to man. For example, code point U+004D ¹ is “Latin Capital Character M”. U+03C0 is “Greek Small Letter Pi”, or π. The idea was that every character, in every language, got a number assigned to it, and everyone could then later read that number again as the correct character².

Pretty quickly, they ran into problems, though. For example, Chinese characters (“hanzi”) are used in other languages too. Japanese calls them “kanji”, and in Korean, they’re called “hanja”. So, you can have a character that looks the same in all three of these languages, but can mean completely different things³.

And then there’s Emoji. Emoji were introduced by Japanese phone carriers, and each carrier had their own set that were incompatible to each other. Unicode’s attempt to unify these was a big sticking point for a long time, but eventually, Emojis got assigned their code points. U+1F4A9 is the ever-popular 💩, U+ 1F44B is 👋, and U+ 1F631 is 😱⁴.

The important point is that every Emoji you send someone is not sent as a picture, but as a code point. And the vendor of the system that displays an emoji relies on Unicode’s description to decide how it should look like. This, too, can sometimes lead to problems. When it comes to the composition of a burger, it’s humorous. But other issues can easily become political too: When Apple decided in 2016 to replace the gun emoji symbol from a revolver with a squirt gun⁵, you could send someone the image of a playful squirt gun on your side, while the other person might see an actual gun on their end.

So, even a seemingly simple thing like mapping every character to a number can turn out to be surprisingly difficult. And with Emojis, they’re actually a living thing. For example, when Tony Hawk complained that the skateboard emoji looked a bit too retro, he was actually invited to update it to a more modern representation.

📖 Weekly Longread 📚

Inside the growth of the most controversial brand in the wellness industry: How Goop’s Haters Made Gwyneth Paltrow’s Company Worth $250 Million

🦄 Unicorn Chaser 🦄

A strategy guide for using a semi-pointless social network in all the wrong ways: How to beat LinkedIn: The Game

The “U+” indicates that it is a code point, and the following numbers are written in hexadecimal to denote the code point itself. ↩
It should be noted that Unicode doesn’t actually specify how these numbers should be represented in computers. That’s the purpose of Unicode encodings, such as UTF-8 or UTF-16, that tell a computer how to convert a Unicode code point into actual bytes (and vice versa). ↩
This is a topic for a different newsletter, though. ↩
It’s actually a bit more complicated with certain emojis. For example, 🤦🏻‍♂️ is actually comprised of 5 Unicode code points: U+1F926 to denote person facepalming, U+1F3FB to denote the white skin type, U+200D is a zero-width joiner (so the system rendering the characters knows this skin type is joined to the next character), U+2642 is the male sign ♂, and U+FE0F tells the rendering system to use a fancy colourful emoji instead of a monochrome dull one. Simple, right? ↩
Microsoft actually used a sci-fi looking ray gun for most of the time — until they decided to replace it with an actual gun to ensure that everyone would see the same thing. They did that in 2016. ↩

#070: The Emoji Code

Other interesting links from around the web:

📖 Weekly Longread 📚

🦄 Unicorn Chaser 🦄