Definition

Unicode

Robert Sheldon

By

Robert Sheldon

Published: May 16, 2023

What is Unicode?

Unicode is a universal character encoding standard that is maintained by the Unicode Consortium, a standards organization founded in 1991 for the internationalization of software and services. Officially called the Unicode Standard, it provides the basis for "processing, storage and interchange of text data in any language in all modern software and information technology protocols." Unicode now supports all the world's writing systems, both modern and ancient, according to the consortium.

The most recent edition of the Unicode Standard is version 15.0.0, which includes encodings for 149,186 characters. The standard bases its encodings on language scripts -- the characters used for written language -- rather than on the languages themselves. In this way, multiple languages can use the same set of encodings, resulting in the need for fewer encodings. For example, the standard's Latin script encodings support English, German, Danish, Romanian, Bosnian, Portuguese, Albanian and hundreds of other languages.

Along with the standard's character encoding, the Unicode Consortium provides descriptions and other data about how the characters function, along with details about topics such as the following:

Forming words and breaking lines.
Sorting text in different languages.
Formatting numbers, dates and times.
Displaying languages written from right to left.
Dealing with security concerns related to "look-alike" characters.

The Unicode Standard defines a consistent approach to encoding multilingual text that enables interoperability between disparate systems. In this way, Unicode enables the international exchange of information in a consistent and efficient manner, while providing a foundation for global software and services. Both HTML and XML have adopted Unicode as their default system of encoding, as have most modern operating systems, programming languages and internet protocols.

How XML and HTML compare. — Unicode is the default system of encoding for both XML and HTML.

The rise of Unicode

Before Unicode was universally adopted, different computing environments relied on their own systems to encode text characters. Many of them started with the American Standard Code for Information Interchange (ASCII), either conforming to this standard or enhancing it to support additional characters. However, ASCII was generally limited to English or languages using a similar alphabet, leading to the development of other encoding systems, none of which could address the needs of all the world's languages.

Before long, there were hundreds of individual encoding standards in use, many of which conflicted with each other at the character level. For example, the same character might be encoded in several different ways. If a document was encoded according to one system, it might be difficult or impossible to read that document in an environment based on another system. These types of issues became increasingly pronounced as the internet continued to gain in popularity and more data was being exchanged across the globe.

The Unicode Standard started with the ASCII character set and steadily expanded to incorporate more characters and subsequently more languages. The standard assigns a name and a numeric value to each character. The numeric value is referred to as the character's code point and is expressed in a hexadecimal form that follows the U+ prefix. For example, the code point for the Latin capital letter A is U+0041.

Unicode treats alphabetic characters, ideographic characters and other types of symbols in the same way, helping to simplify the overall approach to character encoding. At the same time, it takes into account the character's case, directionality and alphabetic properties, which define its identity and behavior.

The Unicode Standard represents characters in one of three encoding forms:

8-bit form (UTF-8). A variable-length form in which each character is between 1 and 4 bytes. The first byte indicates the number of bytes used for that character. The first byte of a 1 byte character starts with 0. A 2 byte character starts with 110, a 3 byte character starts with 1110, and a 4 byte character starts with 11110. UTF-8 preserves ASCII in its original form, providing a 1:1 mapping that eases migrations. It is also the most commonly used form across the web.
16-bit form (UTF-16). A variable-length form in which each character is 2 or 4 bytes. The original Unicode Standard was also based on a 16-bit format. In UTF-16, characters that require 4 bytes, such as emoji and emoticons, are considered supplementary characters. Supplementary characters use two 16-bit units, referred to as surrogate pairs, totaling 4 bytes per character. The 16-bit form can add implementation overhead because of the variability between 16-bit and 32-bit characters.
32-bit form (UTF-32). A fixed-length form in which each character is 4 bytes. This form can reduce implementation overhead because it simplifies the programming model. On the other hand, it forces the use of 4 bytes for all characters, no matter how few bytes are needed, increasing the impact on storage resources.

The 16-bit and 32-bit forms both support little-endian byte serialization, in which the least significant byte is stored first, and big-endian byte serialization, in which the most significant byte is first. In addition, all three forms support the full range of Unicode code points, U+0000 through U+10FFFF, which totals 1,114,112 possible code points. However, the majority of common characters in the world's chief languages are encoded in the first 65,536 code points, which are known as the Basic Multilingual Plane, or BMP.

Read this primer for developers on how binary and hexadecimal number systems work.

Continue Reading About Unicode

Is HTML a programming language?

How are hackers using Unicode domains for spoofing attacks?

Using PowerShell to create XML documents

Why do EDI standards dominate for electronic transactions?

How to convert binary to decimal

Search Networking

What is an SD-branch?
An SD-branch is a single, automated, centrally managed, software-centric platform that replaces or supplements an existing branch...
What is the domain name system (DNS)?
The domain name system (DNS) is a naming database in which internet domain names are located and translated into Internet ...
What is attenuation?
Attenuation is a general term referring to when any type of signal -- digital or analog -- reduces in strength.

Search Security

What is a CISO as a service (CISOaaS)?
CISO as a service, or CISOaaS, is the outsourcing of CISO (chief information security officer) and information security ...
What is post-quantum cryptography? Comprehensive guide
Post-quantum cryptography, also known as quantum encryption or PQC, is the development of cryptographic systems for classical ...
What is a message authentication code (MAC)? How it works and best practices
A message authentication code (MAC) is a cryptographic checksum applied to a message to guarantee its integrity and authenticity.

Search CIO

What is a procurement plan?
A procurement plan -- also called a procurement management plan -- is a document that is used to manage the process of finding ...
What is a quantum circuit? Quantum vs. classical circuit
Quantum circuits are systems consisting of logic gates that operate on quantum bits (qubits) to process information and perform ...
What is prescriptive analytics?
Prescriptive analytics is a type of data analytics that provides guidance on what should happen next.

Search HRSoftware

What is a talent pipeline?
A talent pipeline is a pool of candidates who are ready to fill a position.
What is an applicant tracking system (ATS)?
An applicant tracking system (ATS) is software that manages the recruiting and hiring process, including job postings and job ...
What is manager self-service?
Manager self-service is a type of human resource management (HRM) platform that gives supervisors immediate access to employee ...

Search Customer Experience

What is field service management (FSM)?
Field service management (FSM) is a system of managing off-site workers and the resources they require to do their jobs ...
What are customer service and support?
Customer service is the support organizations offer to customers before, during and after purchasing a product or service.
What is quality of experience (QoE or QoX)?
Quality of experience (QoE or QoX) is a measure of the overall level of a customer's satisfaction and experience with a product ...

Close