Encoding and Character Sets
ඞᐛൠॐ✈Я( ͡° ͜ʖ ͡°)ੴ ็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็ ็
Hello all. Huuuuge change of pace with this post. We’re going to take a break from crazy, autistic reverse engineering and program injections to go over something a lot of developers are familiar with but don’t really know. At least, they think they know.
Today’s topic is encoding and you’ve probably heard of it mentioned once or twice maybe in class or in that Python tutorial. I can already hear you… “I know what ASCII is, Crawfish. You don’t have to explain it to me.” Well, you may know about ASCII, but you’re about 40 years late with that one. We are in the 21st century with crazy character sets that let us do ṯ̶̝̂̈́h̴͚͒ḭ̴͠ṅ̸͉̑͜g̴͈̼̍s̸͖̕ ̵̹̓ 🅻🅸🅺🅴 ⓣʰ𝐈𝐒. I hope that goes through the email.
You need to be aware of these crazy characters. Otherwise, you may be stumped when it comes to parsing and printing user input. Substack can handle it, so why can’t you?
It’s time for a history lesson.
Back when Dennis Ritchie was creating the greatest language of all time (totally not biased), the only characters that mattered were English characters. Go figure. Since computers can only read numbers, there needed to be a system that assigned numbers to letters. This technically already existed, but learning about EBCDIC in the modern age is akin to learning how to dial a rotary phone. ASCII was invented as a simple, 7-bit character encoding standard. Since computers at the time only really work in 8-bit, this was huge, as you had an entire extra bit to work with.
ASCII worked, and it was good… if you spoke English. Different countries utilized the upper bit for their own language, so if you sent an email from Russia to Spain, you might have had some trouble getting your message translated. Don’t even get me started with Asia.
Since ASCII clearly wasn’t going to work for international communication, the ANSI standard was created. ANSI isn’t technically a character set. It just extended ASCII to its full 8 bytes. Microsoft got their ass in gear and created what are called code pages to allow multinational computers to translate the upper 8 bits.
Remember Asia? Well, they didn’t go anywhere. Clearly, 8 bits are not going to cut it when it comes to languages with hundreds (thousands?) of characters. So, why not use more bits? Then along came Unicode.
Unicode is not a character set. Rather, it is a system of encoding text into the proper, respective character sets. This is where we get into code points. Code points are the numerical representation of a character. This isn’t the same as ASCII, where 0x41 = A
. The Unicode representation of A would be U+0041
. If you want to print the ASCII version of A, you would ask Unicode to encode the text into ASCII. The ASCII compatible Unicode representation is called UTF-8. Sound familiar?
Everyone immediately jumped onto the idea of using 2 bytes per character. That’s great, let’s do a quick example.
In ASCII
B o w T i e d
42 6f 77 54 69 65 64
In 2 bytes
B o w T i e d
00 42 00 6f 00 77 00 54 00 69 00 65 00 64
Looks easy enough, right? But what if we wanted to do it big-endian?
B o w T i e d
42 00 6f 00 77 00 54 00 69 00 65 00 64 00
At the beginning of ASCII, many developers wanted the representation to be determined by the endianness of the machine being used. Yet there were some who wanted everything to be the way it used to be. Obviously, these were Americans who would never use any characters outside of the classic ASCII standard.
Behold, UTF-8 was invented to save us lazy Americans. UTF-8 is magical enough to use a single byte if the code point is < 127. Otherwise, it will use more. You may have never realized this being a native English speaker, but the rest of the world has to dance around with multiple bytes if their string has a résumé or déjà vu.
UTF-8 uses 1 to 4 bytes to encode characters. 4 bytes! That makes, what, 4 billion possible characters? You would think so, right? Well, not really. Unicode restricts the maximum amount of bits that UTF-8 can use to 21. So, technically, you’ve only got 2 million, right? Okay, I’ll stop blue-balling you for now. This is also incorrect. The Unicode standard officially can handle 1,111,998 characters.
Introducing planes. No, not ✈ although that would make sense considering we’re talking about Unicode. Planes are the first 2 bytes of a Unicode representation, and there are 17 of them. The first one of them (the one that most of these characters you are reading have) is called the Basic Multilingual Plane. The remaining 16 are known as Supplemental Planes. So let’s do the math, 17 planes that signify the representation of the the remaining 2 bytes. That means a plane and its character set will be from 0x000000 - 0x00FFFF
, which is 65536. Planes go from 0x0 - 0x10
, which is 17. 17 * 65536 = 1,114,112, which is a bit more than the maximum amount. Why? Surrogates.
The surrogates are 0xDC00
through 0xDFFF
and 0xD800
through 0xDBFF
. 1024 each which adds up to 2048. 1,114,112 - 2048 = 1,112,064, almost to our magic number.
Lastly, there are 66 non-characters. They are 0xFDD0
through 0xFDEF
and the last 2 characters for each plane, 0xFFFE
, 0xFFFF
, 0x1FFFE
, 0x1FFFF
, 0x2FFFE
, 0x2FFFF
, …, 0x10FFFE
, 0x10FFFF
. So that’s 2 * 17 + 32 = 66, which finally gets us to 1,111,998.
Yes, I know, overly complicated. But now you finally understand that you don’t actually know what you think you know. Just because you see plain English doesn’t mean the text is ASCII.
That’s all for now, enjoy your weekend.
Go!
-BowTiedCrawfish