The Unicode® Standard: A Technical
Introduction
The Unicode Standard is the universal character encoding
standard used for representation of text for computer processing.
Versions of the Unicode Standard are fully compatible and
synchronized with the corresponding versions of
International Standard ISO/IEC 10646. For example, Unicode 5.0
contains all the same characters and encoding points as ISO/IEC
10646:2003 plus amendments. The Unicode Standard provides
additional information about the characters and their use. Any
implementation that is conformant to Unicode is also conformant to
ISO/IEC 10646.
Unicode provides a consistent way of encoding multilingual
plain text and brings order to a chaotic state of affairs that has
made it difficult to exchange text files internationally. Computer
users who deal with multilingual text -- business people,
linguists, researchers, scientists, and others -- will find that
the Unicode Standard greatly simplifies their work. Mathematicians
and technicians, who regularly use mathematical symbols and other
technical characters, will also find the Unicode Standard
valuable.
The design of Unicode is based on the simplicity and
consistency of ASCII, but goes far beyond ASCII's limited ability
to encode only the Latin alphabet. The Unicode Standard provides
the capacity to encode all of the characters used for the written
languages of the world. To keep character coding simple and
efficient, the Unicode Standard assigns each character a unique
numeric value and name.
The Unicode Standard and ISO/IEC 10646 support three encoding
forms that use a common repertoire of characters. These encoding
forms allow for encoding as many as a million characters. This is
sufficient for all known character encoding requirements,
including full coverage of all historic scripts of the world, as
well as common notational systems.
What Characters Does
the Unicode Standard Include?
The Unicode Standard defines codes for characters used in all
the major languages written today. Scripts include the European
alphabetic scripts, Middle Eastern right-to-left scripts, and many
scripts of Asia.
The Unicode Standard further includes punctuation marks,
diacritics, mathematical symbols, technical symbols, arrows,
dingbats, etc. It provides codes for diacritics, which are
modifying character marks such as the tilde (~), that are used in
conjunction with base characters to represent accented letters (ñ,
for example). In all, the Unicode Standard, Version 5.0 provides
codes for 99,089 characters from the world's alphabets, ideograph
sets, and symbol collections.
The majority of common-use characters fit into the first 64K
code points, an area of the codespace that is called the basic
multilingual plane, or BMP for short. There are several
thousand unused code points for future expansion in the BMP, plus
over 870,000 unused supplementary code points on the other planes.
More characters are under consideration for addition to future
versions of the standard.
The Unicode Standard also reserves code points for private use.
Vendors or end users can assign these internally for their own
characters and symbols, or use them with specialized fonts. There
are 6,400 private use code points on the BMP and another 131,068
supplementary private use code points, should 6,400 be
insufficient for particular applications.
Character encoding standards define not only the identity of
each character and its numeric value, or code point, but also how
this value is represented in bits.
The Unicode Standard defines three encoding forms that allow
the same data to be transmitted in a byte, word or double word
oriented format (i.e. in 8, 16 or 32-bits per code unit). All
three encoding forms encode the same common character
repertoire and can be efficiently transformed into one another
without loss of data. The Unicode Consortium fully endorses the
use of any of these encoding forms as a conformant way of
implementing the Unicode Standard.
UTF-8 is popular for HTML and similar protocols. UTF-8 is a way
of transforming all Unicode characters into a variable length
encoding of bytes. It has the advantages that the Unicode
characters corresponding to the familiar ASCII set have the same
byte values as ASCII, and that Unicode characters transformed into
UTF-8 can be used with much existing software without extensive
software rewrites.
UTF-16 is popular in many environments that need to balance
efficient access to characters with economical use of storage. It
is reasonably compact and all the heavily used characters fit into
a single 16-bit code unit, while all other characters are
accessible via pairs of 16-bit code units.
UTF-32 is popular where memory space is no concern, but fixed
width, single code unit access to characters is desired. Each
Unicode character is encoded in a single 32-bit code unit
when using UTF-32.
All three encoding forms need at most 4 bytes (or 32-bits) of
data for each character.
Written languages are represented by textual elements that are
used to create words and sentences. These elements may be letters
such as "w" or "M"; characters such as those used in Japanese
Hiragana to represent syllables; or ideographs such as those used
in Chinese to represent full words or concepts.
The definition of text elements often changes depending
on the process handling the text. For example, in historic Spanish
language sorting, "ll"; counts as a single text element. However,
when Spanish words are typed, "ll" is two separate text elements:
"l" and "l".
To avoid deciding what is and is not a text element in
different processes, the Unicode Standard defines code
elements (commonly called "characters"). A code element is
fundamental and useful for computer text processing. For the most
part, code elements correspond to the most commonly used text
elements. In the case of the Spanish "ll", the Unicode Standard
defines each "l" as a separate code element. The task of combining
two "l" together for alphabetic sorting is left to the software
processing the text.
Computer text handling involves processing and encoding.
Consider, for example, a word processor user typing text at a
keyboard. The computer's system software receives a message that
the user pressed a key combination for "T", which it encodes as
U+0054. The word processor stores the number in memory, and also
passes it on to the display software responsible for putting the
character on the screen. The display software, which may be a
window manager or part of the word processor itself, uses the
number as an index to find an image of a "T", which it draws on
the monitor screen. The process continues as the user types in
more characters.
The Unicode Standard directly addresses only the encoding and
semantics of text. It addresses no other action performed on the
text. For example, the word processor may check the typist's input
as it is being entered, and display misspellings with a wavy
underline. Or it may insert line breaks when it counts a certain
number of characters entered since the last line break. An
important principle of the Unicode Standard is that it does not
specify how to carry out these processes as long as the character
encoding and decoding is performed properly.
The difference between identifying a code point and rendering
it on screen or paper is crucial to understanding the Unicode
Standard's role in text processing. The character identified by a
Unicode code point is an abstract entity, such as "LATIN CHARACTER
CAPITAL A" or "BENGALI DIGIT 5." The mark made on screen or paper
-- called a glyph -- is a visual representation of the
character.
The Unicode Standard does not define glyph images. The standard
defines how characters are interpreted, not how glyphs are
rendered. The software or hardware-rendering engine of a computer
is responsible for the appearance of the characters on the screen.
The Unicode Standard does not specify the size, shape, nor style
of on-screen characters.
Text elements are encoded as sequences of one or more
characters. Certain of these sequences are called combining
character sequences, made up of a base letter and one or more
combining marks, which are rendered around the base letter (above
it, below it, etc.). For example, a sequence of "a" followed by a
combining circumflex "^" would be rendered as "â". For more
information on how sequences of characters are used to represent
text in different languages, see "Where is my
Character?", and for information on grapheme clusters (what
end-users think of as characters), see UAX #29, Text
Boundaries.
The Unicode Standard specifies the order of characters in a
combining character sequence. The base character comes first,
followed by one or more non-spacing marks. If there is more than
one non-spacing mark, the order in which the non-spacing marks are
stored isn't important if the marks don't interact
typographically. If they do interact, then their order is
important. The Unicode Standard specifies how successive
non-spacing characters are applied to a base character, and when
the order is significant.
Certain sequences of characters can also be represented as a
single character, called a precomposed character (or
composite or decomposible character). For example,
the character "ü" can be encoded as the single code point U+00FC
"ü" or as the base character U+0075 "u" followed by the
non-spacing character U+0308 "¨". The Unicode Standard encodes
precomposed characters for compatibility with established
standards such as Latin 1, which includes many precomposed
characters such as "ü" and "ñ".
Precomposed characters may be decomposed for consistency or
analysis. For example, in alphabetizing (collating) a list of
names, the character "ü" may be decomposed into a "u" followed by
the non-spacing character "¨". Once the character has been
decomposed, it may be easier for the collation to work with the
character because it can be processed as a "u" with modifications.
This allows easier alphabetical sorting for languages where
character modifiers do not affect alphabetical order. The Unicode
Standard defines the decompositions for all precomposed
characters. It also defines normalization forms in UAX #15, Unicode
Normalization Forms, to provide for unique representations of
characters.
The Unicode Standard was created by a team of computer
professionals, linguists, and scholars to become a worldwide
character standard, one easily used for text encoding everywhere.
To that end, the Unicode Standard follows a set of fundamental
principles:
The character sets of many existing international, national and
corporate standards are incorporated within the Unicode Standard.
For example, its first 256 characters are taken from the widely
used Latin-1 character set.
Duplicate encoding of characters is avoided by unifying
characters within scripts across languages; characters that are
equivalent in form are given a single code.
Chinese/Japanese/Korean (CJK) consolidation is achieved by
assigning a single code for each ideograph that is common to more
than one of these languages. This is instead of providing a
separate code for the ideograph each time it appears in a
different language. (These three languages share many thousands of
identical characters because their ideograph sets evolved from the
same source.)
The Unicode Standard specifies an algorithm for the
presentation of text with bidirectional behavior, for example,
Arabic and English. Characters are stored in logical order. The
Unicode Standard includes characters to specify changes in
direction when scripts of different directionality are mixed. For
all scripts Unicode text is in logical order within the memory
representation, corresponding to the order in which text is typed
on the keyboard.
A single number is assigned to each code element defined by the
Unicode Standard. Each of these numbers is called a code
point and, when referred to in text, is listed in hexadecimal
form following the prefix "U". For example, the code point U+0041
is the hexadecimal number 0041 (equal to the decimal number 65).
It represents the character "A" in the Unicode Standard.
Each character is also assigned a unique name that specifies it
and no other. For example, U+0041 is assigned the character name
"LATIN CAPITAL LETTER A." U+0A1B is assigned the character name
"GURMUKHI LETTER CHA." These Unicode names are identical to the
ISO/IEC 10646 names for the same characters.
The Unicode Standard groups characters together by scripts in
code blocks. A script is any system of related characters.
The standard retains the order of characters in a source set where
possible. When the characters of a script are traditionally
arranged in a certain order -- alphabetic order, for example --
the Unicode Standard arranges them in its code space using the
same order whenever possible. Code blocks vary greatly in size.
For example, the Cyrillic code block does not exceed 256 code
points, while the CJK code blocks contain many thousands of code
points.
Code elements are grouped logically throughout the range of
code points, called the codespace. The coding starts at
U+0000 with the standard ASCII characters, and continues with
Greek, Cyrillic, Hebrew, Arabic, Indic and other scripts; then
followed by symbols and punctuation. The code space continues with
Hiragana, Katakana, and Bopomofo. The unified Han ideographs are
followed by the complete set of modern Hangul. The range of
surrogate code points is reserved for use with UTF-16.
Towards the end of the BMP is a range of code points reserved for
private use, followed by a range of compatibility characters. The
compatibility characters are character variants that are encoded
only to enable transcoding to earlier standards and old
implementations, which made use of them.
A range of code points on the BMP and two very large ranges in
the supplementary planes are reserved as private use areas.
These code points have no universal meaning, and may be used for
characters specific to a program or by a group of users for their
own purposes. For example, a group of choreographers may design a
set of characters for dance notation and encode the characters
using code points in user space. A set of page-layout programs may
use the same code points as control codes to position text on the
page. The main point of user space is that the Unicode Standard
assigns no meaning to these code points, and reserves them as user
space, promising never to assign them meaning in the future.
The Unicode Standard specifies unambiguous requirements for
conformance in terms of the principles and encoding architecture
it embodies. A conforming implementation has the following
characteristics, as a minimum requirement:
- characters are from the common repertoire;
- characters are encoded according to one of the encoding
forms;
- characters are interpreted with Unicode semantics;
- unassigned codes are not used; and,
- unknown characters are not corrupted.
Implementations of the Unicode Standard are conformant as long
as they follow the rules for the encoding characters into
sequences of bytes, words or double words that are in effect for
the chosen encoding form and otherwise interpret characters
according to the Unicode specification. The full conformance
requirements are available within The Unicode
Standard, Version 5.0, Addison-Wesley, 2007, taking into
consideration any later update
versions.
The Unicode Standard has a lot of room to grow, and there are a
considerable number of scripts that will be encoded in upcoming
versions. This process is strictly additive, in other
words, while characters may be added or new character properties
may be defined, no characters will be removed -- or reinterpreted
in incompatible ways. These stability
guarantees make it possible to encode data in Unicode and
expect that future implementations that conform to a later version
of the Unicode Standard will be able to interpret them in the same
way as implementations conforming to an earlier version of the
standard.
The Unicode Standard is very closely aligned with the
international standard ISO/IEC 10646 (also known as the Universal
Character Set, or UCS, for short). Close cooperation and formal
liaison between the committees has ensured that all additions to
either standard are coordinated and kept in synch, so that the two
standards maintain exactly the same character repertoire and
encoding.
Version 5.0 of the Unicode Standard is code-for-code identical
to ISO/IEC 10646:2003 plus amendments. This code-for-code identity
is true for all encoded characters in the two standards, including
the East Asian (Han) ideographic characters. Subsequent versions
of the Unicode Standard track additional parts and amendments to
ISO/IEC 10646.
The Unicode encoding forms correspond exactly to forms of use
and transformation formats also defined in ISO/IEC 10646. UTF-8
and UTF-16 are defined in Annexes to ISO/IEC 10646. And UTF-32
corresponds to the four-octet form UCS-4 of ISO/IEC 10646.
Authoritative information can be found at Latest
Version of the Unicode Standard. That link will guide you both
to the most recent major version, published as a book, and to the
subsequent minor versions, published on the web. The Unicode
Standard, Version 5.0 may be ordered from the Unicode
Consortium by using the Book Order
Form. The Unicode Standard Updates
and Errata are posted on this web site.
This web site also contains additional technical material and
information on using the Unicode Standard. See the related
links in the left hand column.