ASCII and UTF-16
As we all know, computers only talk in binary. They only understand 0 and 1. When we say “cat,” the computer understands it as “0110 0011 0110 0001 0111 0100”. But ever wondered how do computers understand and interpret human languages like English, Spanish, or any other language for that matter?
ASCII
To understand the said mechanism, we need to know about ASCII, i.e., American Standard Code for Information Interchange. It is one of the earliest Unicode systems. In simplest terms, Unicode can be explained as a character encoding standard that is followed by everyone on the web. Unicode maintains consistency during data interchange.
With the help of ASCII, computers translate textual and numerical information into binary format. In ASCII, every letter, number, and punctuation has a specific decimal number associated with it. This decimal number is then converted into its binary equivalent so that the computer can read it. Much like character encoding, computers also do decoding so that the information given out is in a human-readable format.
Now, let’s understand this with an example. Let’s take the word “cat” and determine how the character encoding is done using ASCII.
Note: You can check the binary values in this ASCII table.
/* First we will find decimal numbers associated with each letter in the word “cat”.
c => 99a => 97t => 116Now we shall find binary equivalents of the numbers above.c => 99 => 0110 0011a => 97 => 0110 0001t => 116 => 0111 0100So the word “cat” can be written as 0110 0011 0110 0001 0111 0100. */
Please note that “cat” will be different from “CAT” since decimals corresponding to lowercase letters and uppercase letters will vary from each other.
What is UTF-16, and why do we need it?
ASCII code is represented in 8 bits; hence the range of ASCII is 0 to 127. Due to its limited size, ASCII is suitable for only a small character set. It primarily covers alphabets, controls, some special characters, and punctuation symbols. All of this provided by ASCII is sufficient for information exchange in the English language. Initially, ASCII was being used for emails and URLs, but the need for advanced Unicode sets arose with time as languages other than English were needed for digital communication.
Many languages have symbols and characters that were not included in the ASCII. Languages like Japanese, Mandarin have different sets of characters altogether. The way these languages are written is vastly different from any Latin based languages. Even with languages of German origin, some special characters were missing in the ASCII. For example, the umlaut from German. [Tschüss two dots above u].
Therefore it became necessary to include all these features for efficient information interchange. This is when UTF came into the picture.
UTF short for Uniform Transformation Format. If you have ever done any HTML, you must be aware of meta tags having something called UTF-8.
<meta charset="UTF-8">
This is nothing but the default character set used by HTML. The character set here consists of almost all the required symbols and characters for a web page to work properly.
Okay.. but what is UTF-16?
UTF-16 is just a continuation of UTF-8. As mentioned above, Asian languages usually have character sets dissimilar to any of the Germanic or Latin origin languages. UTF-16 is predominantly used for such languages. Additionally, UTF-16 also accommodates almost all the emojis.
Now let’s see how UTF-16 works with a simple JavaScript program. In the program below, I have defined two strings. The first string is the word and the second string consists of hexadecimal values of individual letters from the UTF-16 table.
Note: “\u” acts as an escape character. The actual values are followed by the “\u”.
var string1 = "Tschüss"; //German word for byevar string2 = "\u0054\u0073\u0063\u0068\u00fc\u0073\u0073"; //Equivalent values of each letter derived from UTF-16 chartconsole.log(string1); // printing the first stringconsole.log(string2); // printing the second string
We should expect the same values in output.
Output:
:~$ node sample.jsTschüssTschüss
We can also compare the two strings to check the working of UTF-16. Upon comparison, we shall get “True” as the output.
Note: If you look up any UTF-16 chart, you will find all the values needed.
Differences between UTF-8 and UTF-16
The critical difference between the two is the size occupied by them. UTF-8 makes use of 1 byte, i.e., 8 bits, whereas UTF-16 consumes up to 2 bytes for storage. The first 128 characters in UTF-8 are similar to that of ASCII.
Although UTF-16 is very functional for larger character sets, UTF-8 remains a widely used standard of Unicode for the web worldwide. This is chiefly because UTF-16 is not per ASCII and has compatibility issues.