Unicode – ByteGuide

In the world of computing, Unicode is a universal character encoding system that is used primarily to represent various characters and symbols expressed in different writing styles around the world. Fundamentally, any data, irrespective of whether it is a text or a symbol, is stored inside a computer in the form of numbers. These numbers are then decoded by the operating systems and transformed to letters that become visible on the computer screen.

Before use of Unicode came into practice, a number of traditional encoding systems were in vogue, particularly those that were defined as per ISO 8859 standards. Interestingly, all these systems allowed only bilingual computer processing where the systems could only identify Latin characters and characters present in local script. Secondly, to represent all the letters, punctuation marks and other technical symbols commonly used in one language, say English, no single encoding system had enough characters. As a result of this, computers, particularly servers, had to be manufactured in such a way that they had to support multiple encoding systems. However, this was disadvantageous as it resulted in data getting corrupted due to data being processed by different encodings or platform. Also, using multiple character encoding systems were never compatible with each other’s character encoding format. While some systems used similar numbers for representing two different characters, others had two separate numbers for the same character. So as to get rid of these disadvantages, an International Unicode Consortium was incorporated in the year 1991 to design and develop a standard international text character encoding system called Unicode Transformation Format or UTF that can support multilingual computer processing.

At present, Unicode standard has defined encodings of almost 100,000 characters including their character properties like upper and lower case and other related items such as rendering, collation, decomposition and even bidirectional display order. In the Unicode system, a unique and specific value also called as the code point, is assigned to every character. These code points are usually hexadecimal numbers with a prefix of "U+". Unicode system has a codespace or memory that can define 1,114,112 code points ranging from 0_hex to 10FFFF_hex. All these code points are dispersed among 17 different planes, with each comprising about 65,536 code points distributed across 256 rows with 256 code points in each row. Among these 17 planes, the first plane, also known as the basic multilingual plane or BMP represent the basic characters that are most commonly used.

There are three different encoding forms of Unicode that are commonly used. These include:

UTF-8: In this form, only a single byte (8 bits) is used to define characters. Developed to work in communication with encoding standards used in existing character encoding systems such as ASCII, UTF-8 uses byte sequences to encode characters and is used specifically in emails and internet.
UTF-16: In this, 16-bit method of character encoding is used and is one the most popularly used Unicode forms. Advantage of using UTF-16 is that characters can either be represented as a single 16-bit code unit or as a pair of 16-bit code units, thereby enhancing processing efficiency with respect to multi-byte encodings.
UTF-32: In this form, every code point or character is encoded by a 32-bit integer. UTF-32 was introduced primarily to represent every single character, something that was becoming increasingly difficult with a UTF-16 form.

Inarguably, Unicode has the simplified the entire process of character encodings and is being currently used by almost every major hardware and software company including Yahoo, Microsoft, Google, Sun Microsystems, Adobe Systems, Apple, HP, IBM, Oracle, and SAP. Also, the system is supported by every major operating system including Windows, Java, Macintosh, and Linux/Unix and is also being implemented in many recnt technologies such as XML, Microsoft.Net, Framework, Javascript, LDAP and CORBA 3.0.

Comments - No Responses to “Unicode”