UNICODE CHARACTER DATABASE

Revision	4.0.0
Authors	Mark Davis and Ken Whistler
Date	2003-04-18
This Version	http://www.unicode.org/Public/4.0-Update/UCD-4.0.0.html
Previous Version	http://www.unicode.org/Public/3.2-Update/UnicodeCharacterDatabase-3.2.0.html, http://www.unicode.org/Public/3.2-Update/DerivedProperties-3.2.0.html, http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html, http://www.unicode.org/Public/3.2-Update/PropList-3.2.0.html
Latest Version	http://www.unicode.org/Public/UNIDATA/UCD.html

Summary

This document describes the format and content of the Unicode Character Database (UCD)

Status

This file and the files described herein are part of the Unicode Character Database and are governed by the UCD Terms of Use given below.

The References provide related information that is useful in understanding this document.

Warning: the information in this file does not completely describe the use and interpretation of Unicode character properties and behavior. It must be used in conjunction with the data in the other files in the Unicode Character Database, and relies on the notation and definitions supplied in The Unicode Standard. All chapter references are to Version 4.0.0 of the standard unless otherwise indicated.

Introduction
Conformance
UCD File Format
UCD Files
Properties
Property Values
Other UCD Files
Derived Extracted Properties
Property Invariants
References
Modification History

Introduction

The Unicode Character Database (UCD) is a set of files that define the Unicode character properties and internal mappings. This document describes the properties and files that are part of The Unicode Standard, Version 4.0 [U4.0]. The main changes in this version are:

The four documentation files (UnicodeCharacterDatabase.html, UnicodeData.html, DerivedProperties.html, and PropList.html) have been merged together.
There is an additional index by property instead of by file.
A number of additional properties have been added as a part of Unicode 4.0.

This documentation file does not link directly to other files in the UCD. This is because the files need to be exactly the same in the specific update directory (e.g. http://www.unicode.org/Public/4.0-Update/), and when copied to the "latest" directory (http://www.unicode.org/Public/UNIDATA/).

Conformance

For information on the meaning and application of the terms normative, informative, and provisional, see "Chapter 3, Character Properties" in the Unicode Standard, Version 4.0.

UCD File Format

Files in the UCD use the following format, unless otherwise specified.

Each line of data consists of fields separated by semicolons. The fields are numbered starting with zero. Code points are expressed as hexadecimal numbers with four to six digits. They are written without "U+". Within a sequence of code points, spaces are used for separation. Leading and trailing spaces within a field are not significant.

The first field (0) of each line in the Unicode Character Database files represents a code point or range. The remaining fields (1..n) are properties associated with that code point.

A range of code points is specified by the form "X..Y". Each code point from X to Y has the associated properties. For example:

0000..007F; Basic Latin
0080..00FF; Latin-1 Supplement

1680      ; White_Space # Zs OGHAM SPACE MARK
2000..200A; White_Space # Zs [11] EN QUAD..HAIR SPACE

For backwards compatibility, in the file UnicodeData.txt a range is specified not by the form "X..Y", but by their start and end characters. In such cases, the names of characters in the range are algorithmically derivable. Surrogate code points and private use characters have no names. See U4.0 for more information.
Hash marks ("#") are used to indicate comments: all characters from the hash mark to the end of the line are comments, and disregarded when parsing data. In many files, the comments on data lines use a common format.
```
00BC..00BE ; numeric # No [3] VULGAR FRACTION ONE QUARTER..VULGAR FRACTION THREE QUARTERS
```
The first part of the comment is generally the UCD general category. The symbol "L&" indicates characters of type Lu, Ll, or Lt. This is the same as the LC property in PropertyValueAliases. The code point ranges are calculated so that they all have the same General Category (or LC). While this results in more ranges than are strictly necessary, it makes the contents of the ranges clearer. The second part of the comment (in square brackets), indicates the number of items in a range, if there is one. The third part is the name of the character in field zero: if it is a range, then the character names for the ends of the range are separated by "..".
However, the comments are purely informational, and may change format or be omitted in the future. They should not be parsed for content.
In the following table, NF* refers to one of NFD, NFC, NFKC, or NFKD.
The Unihan data format differs from the standard format, and is described in the header of the file. The header also describes which properties are informative, which are normative, and which are provisional.
In some cases, segments of the file are distinguished by a line starting with an "@" sign.
The files are either Latin-1 or UTF-8. Unless otherwise noted, non-ASCII characters only appear in comments.

UCD Files

The following table describes the format and meaning of each property data file in the UCD. The first column lists the files and the properties for which they contain data. The second column indicates the type of property value: String, Numeric, Enumeration (non-binary), Binary. The third column indicates the status (Normative vs. Informative), and the fourth column provides a description of the data.

The files with a small number of properties are listed first, followed by the files with a large number of properties: DerivedCoreProperties.txt, DerivedNormalizationProperties.txt, Proplist.txt, and UnicodeData.txt. For UnicodeData, the field numbers are supplied in the description. In a number of cases, fields in a data file only contribute to a UCD property; for example, the name field in UnicodeData.txt does not provide all the values for the Name property; Jamo.txt must be used as well.

None of these properties should be used without consulting the relevant discussions in the Unicode Standard.

Where a data file does not explicitly list property values for all code points, the code points are given default property values. These default property values are documented in the data files, with the exception of UnicodeData.txt. For that case the default property values are listed below in parentheses after the property name, with (=) indicating the code point itself. The default property values are also documented in any corresponding extracted data file.

ArabicShaping.txt

Joining_Type
Joining_Group

Basic Arabic and Syriac character shaping properties, such as initial, medial and final shapes. See Section 8.2

BidiMirroring.txt

Bidi_Mirroring_Glyph

Properties for substituting characters in an implementation of bidirectional mirroring. See UAX #9. Do not confuse this with the Bidi_Mirrored property.

Blocks.txt

Block

List of block names, which are arbitrary names for ranges of code points. See Chapter 16.

CompositionExclusions.txt

Composition Exclusion

Properties for normalization. See UAX #15. Unlike other files, CompositionExclusions simply lists the relevant code points.

CaseFolding.txt

Simple_Case_Folding
Case_Folding
Special_Case_Condition

Mapping from characters to their case-folded forms. This is an informative file containing normative derived properties.

Derived from UnicodeData and SpecialCasing. See UAX #21

DerivedAge.txt

Age

N/I

This file shows when various code points were designated/assigned in successive versions of the Unicode standard.

EastAsianWidth.txt

East_Asian_Width

Properties for determining the choice of wide vs. narrow glyphs in East Asian contexts. Property values are described in UAX #11.

HangulSyllableType.txt

Hangul_Syllable_Type

The values L, V, T, LV, and LVT used in Chapter 3.

Jamo.txt

used in Name

The Hangul Syllable names are derived from the Jamo Short Names, as described in Chapter 3.

LineBreak.txt

Line_Break

N/I

Properties for line breaking. For more information, see UAX #14.

NormalizationCorrections.txt

used in Decomposition Mappings

NormalizationCorrections lists code point differences for Normalization Corrigenda. See UAX #15 for more information.

PropertyAliases.txt

n/a

N/I

Property names and abbreviations. These names can be used for XML formats of UCD data, for regular-expression property tests, and other programmatic textual descriptions of Unicode data.

PropertyValueAliases.txt

n/a

N/I

Property value names and abbreviations. These names can be used for XML formats of UCD data, for regular-expression property tests, and other programmatic textual descriptions of Unicode data.

Scripts.txt

Script

Default script values for use in regular expressions. For more information, see UTR #24.

SpecialCasing.txt

Uppercase_Mapping
Lowercase_Mapping
Titlecase_Mapping
Special_Case_Condition

Data for producing (in combination with Unicode Data) the full case mappings.

Unihan.txt (for more information, see Unihan Properties)

Numeric_Type
Numeric_Value

The characters tagged with kPrimaryNumeric, kAccountingNumeric, and kOtherNumeric are given the Numeric_Type numeric, and the values indicated.

Most characters have these properties based on values from the UnicodeData.txt data file. See Numeric_Type.

Unicode_Radical_Stroke

The Unicode radical stroke count, based on the tag kRSUnicode.

DerivedCoreProperties.txt

Alphabetic

Characters with the Alphabetic property. For more information, see Chapter 4, Character Properties.

Generated from: Other_Alphabetic + Lu + Ll + Lt + Lm + Lo + Nl

Default_Ignorable_Code_Point

For programmatic determination of default-ignorable code points. New characters that should be ignored in processing (unless explicitly supported) will be assigned in these ranges, permitting programs to correctly handle the default behavior of such characters when not otherwise supported. For more information, see UAX #29: Text Boundaries.

Generated from Other_Default_Ignorable_Code_Point + Cf + Cc + Cs - White_Space

Lowercase

Characters with the Lowercase property. For more information, see Chapter 4, Character Properties.

Generated from: Other_Lowercase + Ll

Grapheme_Base

For programmatic determination of grapheme cluster boundaries. For more information, see UAX #29: Text Boundaries.

Generated from: [0..10FFFF] - Cc - Cf - Cs - Co - Cn - Zl - Zp - Grapheme_Extend

Grapheme_Extend

For programmatic determination of grapheme cluster boundaries. For more information, see UAX #29: Text Boundaries.

Generated from: Other_Grapheme_Extend + Me + Mn

Note: depending on an application's interpretation of Co (private use), they may be either in Grapheme_Base, or in Grapheme_Extend, or in neither.

ID_Start

Characters that can start an identifier.

Generated from Lu + Ll + Lt + Lm + Lo + Nl + Other_ID_Start

ID_Continue

Characters that can continue an identifier. See Cf Note.

Generated from: ID_Start + Mn + Mc + Nd + Pc

Math

Characters with the Math property. For more information, see Chapter 4, Character Properties.

Generated from: Sm + Other_Math

Uppercase

Characters with the Uppercase property. For more information, see Chapter 4, Character Properties.

Generated from: Lu + Other_Uppercase

XID_Start

Same as ID_Start, except for modifications to allow closure under normalization forms NFKC and NFKD.

Generated from: ID_Start; see Closure Note

XID_Continue

Same as ID_Continue, except for modifications to allow closure under normalization forms NFKC and NFKD.

Generated from: ID_Continue; see Closure Note and Cf Note.

DerivedNormalizationProperties.txt

Full_Composition_Exclusion

Characters that are excluded from composition: those explicitly in CompositionExclusions.txt, plus:
(3) Singleton Decompositions
(4) Non-Starter Decompositions

Expands_On_NFC
Expands_On_NFD
Expands_On_NFKC
Expands_On_NFKD

Characters that expand to more than one character in the specified normalization form.

FC_NFKC_Closure

Characters that require extra mappings for closure under Case Folding plus Normalization Form KC. Characters marked with this property have a third field with the mapping in it. Generated with the following, where Fold is the default fold operation (not Turkic):

b = NFKC(Fold(a));
c = NFKC(Fold(b));
if (c != b) add mapping from a to c

NFD_Quick_Check
NFKD_Quick_Check
NFC_Quick_Check
NFKC_Quick_Check

For property values, see Decompositions and Normalization.

Proplist.txt

ASCII_Hex_Digit

ASCII characters commonly used for the representation of hexadecimal numbers.

Bidi_Control

Those format control characters which have specific functions in the Bidirectional Algorithm.

Dash

Those punctuation characters explicitly called out as dashes in the Unicode Standard, plus compatibility equivalents to those. Most of these have the Pd General Category, but some have the Sm General Category because of their use in mathematics.

Deprecated

For a machine-readable list of deprecated characters. No characters will ever be removed from the standard, but the usage of deprecated characters is strongly discouraged.

Diacritic

Characters that linguistically modify the meaning of another character to which they apply. Some diacritics are not combining characters, and some combining characters are not diacritics.

Extender

Characters whose principal function is to extend the value or shape of a preceding alphabetic character. Typical of these are length and iteration marks.

Grapheme_Link

Used in determining default grapheme cluster boundaries. For more information, see UAX #29: Text Boundaries.

Hex_Digit

Characters commonly used for the representation of hexadecimal numbers, plus their compatibility equivalents.

Hyphen (Stabilized as of 3.2)

Those dashes used to mark connections between pieces of words, plus the Katakana middle dot. The Katakana middle dot functions like a hyphen, but is shaped like a dot rather than a dash.

Ideographic

Characters considered to be CJKV (Chinese, Japanese, Korean, and Vietnamese) ideographs.

IDS_Binary_Operator

Used in Ideographic Description Sequences.

IDS_Trinary_Operator

Used in Ideographic Description Sequences.

Join_Control

Those format control characters which have specific functions for control of cursive joining and ligation.

Logical_Order_Exception

There are a small number of characters that do not use logical order. These characters require special handling in most processing.

Noncharacter_Code_Point

Code points that are explicitly defined as illegal for the encoding of characters.

Other_Alphabetic

Used in deriving the Alphabetic property.

Other_Default_Ignorable_Code_Point

Used in deriving the Default_Ignorable_Code_Point property.

Other_Grapheme_Extend

Used in deriving the Grapheme_Extend property.

Other_ID_Start

Used for backwards compatibility of ID_Start

Other_Lowercase

Used in deriving the Lowercase property.

Other_Math

Used in deriving the Math property.

Other_Uppercase

Used in deriving the Uppercase property.

Quotation_Mark

Those punctuation characters that function as quotation marks.

Radical

Used in Ideographic Description Sequences.

Soft_Dotted

Characters with a "soft dot", like i or j. An accent placed on these characters causes the dot to disappear. An explicit dot above can be added where required, such as in Lithuanian.

Terminal_Punctuation

Those punctuation characters that generally mark the end of textual units.

Unified_Ideograph

Used in Ideographic Description Sequences.

White_Space

Those separator characters and control characters which should be treated by programming languages as "white space" for the purpose of parsing elements.

Note: ZERO WIDTH SPACE and ZERO WIDTH NO-BREAK SPACE are not included, since their functions are restricted to line-break control. Their names are unfortunately misleading in this respect.

Note: There are other senses of "whitespace" that encompass a different set of characters.

UnicodeData.txt

Name* (<reserved>)

(1) These names match exactly the names published in the code charts of the Unicode Standard. The Hangul Syllable names are omitted from this file; see Jamo.txt.

General_Category (Cn)

(2) This is a useful breakdown into various character types which can be used as a default categorization in implementations. For the property values, see General Category Values.

Canonical_Combining_Class (0)

(3) The classes used for the Canonical Ordering Algorithm in the Unicode Standard. For the property value names associated with different numeric values, see DerivedCombiningClass.txt and Canonical Combining Class Values.

Bidi_Class (L, AL, R)

(4) These are the categories required by the Bidirectional Behavior Algorithm in the Unicode Standard. For the property values, see Bidi Class Values. For more information, see UAX #9 Bidirectional Algorithm.

The default property values depend on the code point:

U+0590..U+05FF, U+07C0..U+08FF, U+FB1D..U+FB4F, U+10800..U+10FFF

(In 4.0.0, this includes the Hebrew and Cypriot Syllabary blocks, plus the reserved code points in U+07C0..U+08FF, U+FB1D..U+FB4F, U+10840..U+10FFF)

U+0600..U+07BF, U+FB50..U+FDCF, U+FDF0..U+FDFF, U+FE70..U+FEFE

(In 4.0.0, this includes the Arabic, Syriac, Thaana, Arabic Presentation Forms-A, and Arabic Presentation Forms-B blocks, plus the reserved code points in U+0750..U+077F, minus the noncharacters U+FDD0..U+FDEF and the BOM U+FEFF)

Otherwise

Decomposition_Type (None)
Decomposition_Mapping (=)

E
S

(5) This field contains both values, with the type in angle brackets. The decomposition mappings match exactly the decomposition mappings published with the character names in the Unicode Standard. For more information, see Character Decomposition Mappings.

Numeric_Type (None)
Numeric_Value (Not a Number)

E
N

(6) If the character has the decimal digit property, as specified in Chapter 4 of the Unicode Standard, then the value of that digit is represented with an integer value in fields 6, 7, and 8.

E
N

(7) If the character has the digit property, but is not a decimal digit, then the value of that digit is represented with an integer value in fields 7 and 8. This covers digits that need special handling, such as the compatibility superscript digits.

E
N

(8) If the character has the numeric property, as specified in Chapter 4 of the Unicode Standard, the value of that character is represented with an positive or negative integer or rational number in this field. This includes fractions as, e.g., "1/5" for U+2155 VULGAR FRACTION ONE FIFTH.

Some characters have these properties based on values from the Unihan data file. See Numeric_Type, Han.

Bidi_Mirrored (N)

(9) If the character has been identified as a "mirrored" character in bidirectional text, this field has the value "Y"; otherwise "N". The list of mirrored characters is also printed in Chapter 4 of the Unicode Standard. Do not confuse this with the Bidi_Mirroring_Glyph property.

Unicode_1_Name (<none>)

(10) This is the old name as published in Unicode 1.0. This name is only provided when it is significantly different from the current name for the character. The value of field 10 for control characters does not always match the Unicode 1.0 names. Instead, field 10 contains ISO 6429 names for control functions, for printing in the code charts.

ISO_Comment (<none>)

(11) This is the ISO 10646 comment field. It appears in parentheses in the 10646 names list, or contains an asterisk to mark an Annex P note.

Simple_Uppercase_Mapping (=)

(12) Simple uppercase mapping (single character result). If a character is part of an alphabet with case distinctions, and has a simple upper case equivalent, then the upper case equivalent is in this field. See the explanation below on case distinctions. The simple mappings have a single character result, where the full mappings may have multi-character results. For more information, see Case Mappings.

Note: The simple uppercase may be omitted in the data file if the uppercase is the same as the code point itself.

Simple_Lowercase_Mapping (=)

(13) Simple lowercase mapping (single character result). Similar to Uppercase mapping.

Note: The simple lowercase may be omitted in the data file if the lowercase is the same as the code point itself.

Simple_Titlecase_Mapping (=)

Similar to Uppercase mapping (single character result).

Note: The simple titlecase may be omitted in the data file if the titlecase is the same as the uppercase.

Notes

Closure: XID_Start and XID_Continue are defined by adding or removing certain special characters as per UAX #15, Annex 7. They do not remove the non-NFKD nor the non-NFKC characters; if that is desired it needs to be a separate filter. They merely ensure that:

if isIdentifer(string)then isIdentifier(NFKC(string))and isIdentifier(NFKD(string))

Cf: The general category Cf characters are not included in ID_Continue nor in XID_Continue; they should continue identifiers, but be filtered out of the result.

For more information on identifiers, see Chapter 5, Implementation Guidelines, and UAX #15, Annex 7.

Stabilized properties are those that have not been found to be particularly useful in practice, and are no longer actively maintained, nor are they extended as new characters are added.

Properties

The following table lists the properties in the UCD. They are roughly organized into groups based on the usage of the property (this grouping is purely for convenience, and has no other implications). The link on each property leads to description in the file index. The contributory properties (those of the form Other_XXX) are sets of exceptions used to generate properties in DerivedCoreProperties.txt. They are not intended for general use, such as in APIs that return property values.

General	Decomposition and Normalization	CJK
Name	Canonical_Combining_Class	Ideographic
Block	Decomposition_Mapping	Unified_Ideograph
Age	Composition_Exclusion	Radical
General_Category	Full_Composition_Exclusion	IDS_Binary_Operator
Script	Decomposition_Type	IDS_Trinary_Operator
White_Space	FC_NFKC_Closure	Unicode_Radical_Stroke
Alphabetic	NFC_Quick_Check	Misc
Hangul_Syllable_Type	NFKC_Quick_Check	Math
Noncharacter_Code_Point	NFD_Quick_Check	Quotation_Mark
Default_Ignorable_Code_Point	NFKD_Quick_Check	Dash
Deprecated	Expands_On_NFC	Hyphen
Logical_Order_Exception	Expands_On_NFD	Terminal_Punctuation
Case	Expands_On_NFKC	Diacritic
Uppercase	Expands_On_NFKD	Extender
Lowercase	Shaping and Rendering	Grapheme_Base
Lowercase_Mapping	Join_Control	Grapheme_Extend
Titlecase_Mapping	Joining_Group	Grapheme_Link
Uppercase_Mapping	Joining_Type	Unicode_1_Name
Case_Folding	Line_Break	ISO_Comment
Simple_Lowercase_Mapping	East_Asian_Width	Contributory Properties
Simple_Titlecase_Mapping	Bidi	Other_Alphabetic
Simple_Uppercase_Mapping	Bidi_Control	Other_Default_Ignorable_Code_Point
Simple_Case_Folding	Bidi_Mirrored	Other_Grapheme_Extend
Special_Case_Condition	Bidi_Class	Other_ID_Start
Soft_Dotted	Bidi_Mirroring_Glyph	Other_Lowercase
Identifiers	Numeric	Other_Math
ID_Continue	Numeric_Value	Other_Uppercase
ID_Start	Numeric_Type
XID_Continue	Hex_Digit
XID_Start	ASCII_Hex_Digit

Property Values

The following gives a summary of property values for certain properties. Other property values are documented in other locations; for example, the Linebreak property values are documented in UAX #14.

General Category Values

The values in this field are abbreviations for the following values. For more information, see the Unicode Standard.

Note: The Unicode Standard does not assign information to control characters (except for certain cases). Implementations will generally also assign categories to certain control characters, notably CR and LF, according to platform conventions. See Section 5.8 "Newline Guidelines" for more information.

Abbr.	Description
Lu	Letter, Uppercase
Ll	Letter, Lowercase
Lt	Letter, Titlecase
Lm	Letter, Modifier
Lo	Letter, Other
Mn	Mark, Non-Spacing
Mc	Mark, Spacing Combining
Me	Mark, Enclosing
Nd	Number, Decimal
Nl	Number, Letter
No	Number, Other
Pc	Punctuation, Connector
Pd	Punctuation, Dash
Ps	Punctuation, Open
Pe	Punctuation, Close
Pi	Punctuation, Initial quote (may behave like Ps or Pe depending on usage)
Pf	Punctuation, Final quote (may behave like Ps or Pe depending on usage)
Po	Punctuation, Other
Sm	Symbol, Math
Sc	Symbol, Currency
Sk	Symbol, Modifier
So	Symbol, Other
Zs	Separator, Space
Zl	Separator, Line
Zp	Separator, Paragraph
Cc	Other, Control
Cf	Other, Format
Cs	Other, Surrogate
Co	Other, Private Use
Cn	Other, Not Assigned (no characters in the file have this property)

Note: The term "L&" is used to stand for Uppercase, Lowercase or Titlecase letters (Lu, Ll, or Lt) in comments. The LC value in PropertyValueAliases.txt also stands for Uppercase, Lowercase or Titlecase letters.

Bidi Class Values

Please refer to Chapter 3 for an explanation of the algorithm for Bidirectional Behavior and an explanation of the significance of these categories. An up-to-date version can be found on UAX #9: The Bidirectional Algorithm.

Type	Description
L	Left-to-Right
LRE	Left-to-Right Embedding
LRO	Left-to-Right Override
R	Right-to-Left
AL	Right-to-Left Arabic
RLE	Right-to-Left Embedding
RLO	Right-to-Left Override
PDF	Pop Directional Format
EN	European Number
ES	European Number Separator
ET	European Number Terminator
AN	Arabic Number
CS	Common Number Separator
NSM	Non-Spacing Mark
BN	Boundary Neutral
B	Paragraph Separator
S	Segment Separator
WS	Whitespace
ON	Other Neutrals

Character Decomposition Mapping

The tags supplied with certain decomposition mappings generally indicate formatting information. Where no such tag is given, the mapping is canonical. Conversely, the presence of a formatting tag also indicates that the mapping is a compatibility mapping and not a canonical mapping. In the absence of other formatting information in a compatibility mapping, the tag is used to distinguish it from canonical mappings.

In some instances a canonical mapping or a compatibility mapping may consist of a single character. For a canonical mapping, this indicates that the character is a canonical equivalent of another single character. For a compatibility mapping, this indicates that the character is a compatibility equivalent of another single character. The compatibility formatting tags used are:

Tag	Description
<font>	A font variant (e.g. a blackletter form).
<noBreak>	A no-break version of a space or hyphen.
<initial>	An initial presentation form (Arabic).
<medial>	A medial presentation form (Arabic).
<final>	A final presentation form (Arabic).
<isolated>	An isolated presentation form (Arabic).
<circle>	An encircled form.
<super>	A superscript form.
<sub>	A subscript form.
<vertical>	A vertical layout presentation form.
<wide>	A wide (or zenkaku) compatibility character.
<narrow>	A narrow (or hankaku) compatibility character.
<small>	A small variant form (CNS compatibility).
<square>	A CJK squared font variant.
<fraction>	A vulgar fraction form.
<compat>	Otherwise unspecified compatibility character.

Reminder: There is a difference between decomposition and decomposition mapping. The decomposition mappings are defined in the UnicodeData, while the decomposition (also termed "full decomposition") is defined in Chapter 3 to use those mappings recursively.

The canonical decomposition is formed by recursively applying the canonical mappings, then applying the canonical reordering algorithm.
The compatibility decomposition is formed by recursively applying the canonical and compatibility mappings, then applying the canonical reordering algorithm.

Canonical Combining Class Values

Value	Description
0:	Spacing, split, enclosing, reordrant, and Tibetan subjoined
1:	Overlays and interior
7:	Nuktas
8:	Hiragana/Katakana voicing marks
9:	Viramas
10:	Start of fixed position classes
199:	End of fixed position classes
200:	Below left attached
202:	Below attached
204:	Below right attached
208:	Left attached (reordrant around single base character)
210:	Right attached
212:	Above left attached
214:	Above attached
216:	Above right attached
218:	Below left
220:	Below
222:	Below right
224:	Left (reordrant around single base character)
226:	Right
228:	Above left
230:	Above
232:	Above right
233:	Double below
234:	Double above
240:	Below (iota subscript)

Note: some of the combining classes in this list do not currently have members but are specified here for completeness.

Decompositions and Normalization

Decomposition is specified in Chapter 3. UAX #15: Unicode Normalization Forms specifies the interaction between decomposition and normalization. That report specifies how the decompositions defined in UnicodeData.txt are used to derive normalized forms of Unicode text.

Note that as of the 2.1.9 update of the Unicode Character Database, the decompositions in the UnicodeData.txt file can be used to recursively derive the full decomposition in canonical order, without the need to separately apply canonical reordering. However, canonical reordering of combining character sequences must still be applied in decomposition when normalizing source text which contains any combining marks.

The QuickCheck property values are as follows:

Value	File Text	Description
No	NF*_No	Characters that cannot ever occur in the respective normalization form. See Decompositions and Normalization.
Maybe	NF*_Maybe	Characters that may occur in in the respective normalization, depending on the context. See QuickCheck Note.
Yes	n/a	All other characters. This is the default value, and is not explicitly listed in the file.

For more information, see UAX #15 Annex 8.

Case Mappings

There are a number of complications to case mappings that occur once the repertoire of characters is expanded beyond ASCII. For more information, see Chapter 3 in Unicode 4.0.

For compatibility with existing parsers, UnicodeData.txt only contains case mappings for characters where they are one-to-one mappings; it also omits information about context-sensitive case mappings. Information about these special cases can be found in a separate data file, SpecialCasing.txt.

Unihan Tags

The following is a summary of the data tags in the Unihan.txt file. Only a few of these correspond to Unicode normative or informative properties: the rest are provisional. For more information on the meaning of these tags, see the header of the data file.

Category	Property Name	Description from Unihan (abbreviated)
Numeric	kAccountingNumeric	The value of the character when used in the writing of accounting numerals.
	kOtherNumeric	The numeric value for the character in certain unusual, specialized contexts.
	kPrimaryNumeric	The value of the character when used in the writing of numbers in the standard fashion.
Variants	kSemanticVariant	The Unicode value for a semantic variant for this character. A semantic variant is an x- or y-variant with similar or identical meaning which can generally be used in place of the indicated character.
	kSimplifiedVariant	The Unicode value for the simplified Chinese variant for this character (if any).
	kSpecializedSemanticVariant	The Unicode value for a specialized semantic variant for this character. A specialized semantic variant is an x- or y-variant with similar or identical meaning only in certain contexts (such as accountants' numerals).
	kTraditionalVariant	The Unicode value(s) for the traditional Chinese variant(s) for this character.
	kZVariant	The Unicode value(s) for known z-variants of this character.
Radical/Stroke	kRSUnicode	A standard radical/stroke count for this character in the form "radical.additional strokes". A ' after the radical indicates the simplified version of the given radical.
	kRSJapanese	A Japanese radical/stroke count for this character in the form "radical.additional strokes".
	kRSKanWa	A Morohashi radical/stroke count for this character in the form "radical.additional strokes".
	kRSKangXi	A KangXi radical/stroke count for this character in the form "radical.additional strokes".
	kRSKorean	A Korean radical/stroke count for this character in the form "radical.additional strokes". A ' after the radical indicates the simplified version of the given radical.
	kTotalStrokes	The total number of strokes in the character (including the radical).
Pronunciations	kCantonese	The Cantonese pronunciation(s) for this character.
	kJapaneseKun	The Japanese pronunciation(s) of this character.
	kJapaneseOn	The Sino-Japanese pronunciation(s) of this character.
	kKorean	The Korean pronunciation(s) of this character.
	kMandarin	The Mandarin pronunciation(s) for this character in pinyin.
	kTang*	The Tang dynasty pronunciation(s) of this character, derived from _T'ang Poetic Vocabulary_.
	kVietnamese	The character's pronunciation(s) in Quốc ngữ
Definition	kDefinition	An English definition for this character.
Frequency	kFrequency	A rough frequency measurement for the character based on analysis of Chinese USENET postings.
Grade	kGradeLevel*	The grade in the Hong Kong school system by which a student is expected to know the character.
Dictionary Position	kAlternateKangXi	An alternate possible position for the character in the KangXi dictionary.
	kAlternateMorohashi	An alternate possible position for the character in the Morohashi dictionary.
	kCihaiT*	The position of this character in the Cihai (辭海) dictionary, single volume edition, published in Hong Kong by the Zhonghua Bookstore, 1983 (reprint of the 1947 edition), ISBN 962-231-005-2.
	kCowles*	The index of this character in Roy T. Cowles, _A Pocket Dictionary of Cantonese_, Hong Kong: University Press, 1999.
	kDaeJaweon	The position of this character in the Dae Jaweon (Korean) dictionary used in the four-dictionary sorting algorithm.
	kFenn*	Data on the character from _Fenn's Chinese-English Pocket Dictionary_.
	kHanYu	The position of this character in the Hanyu Da Zidian (HDZ) Chinese character dictionary (bibliographic information below).
	kHKGlyph*	The index of the character in 常用字字形表 (二零零零年修訂本), 香港: 香港教育學院, 2000, ISBN 962-949-040-4. This publication gives the "proper" shapes for characters as used in the Hong Kong school system.
	kIRGDaeJaweon	The position of this character in the Dae Jaweon (Korean) dictionary used in the four-dictionary sorting algorithm.
	kIRGDaiKanwaZiten	The index of this character in the Dae Kanwa Ziten, aka Morohashi dictionary (Japanese) used in the four-dictionary sorting algorithm.
	kIRGHanyuDaZidian	The position of this character in the Hanyu Da Zidian (PRC) dictionary used in the four-dictionary sorting algorithm.
	kIRGKangXi	The position of this character in the KangXi dictionary used in the four-dictionary sorting algorithm.
	kKangXi	The position of this character in the KangXi dictionary used in the four-dictionary sorting algorithm.
	kKarlgren*	The index of this character in _Analytic Dictionary of Chinese and Sino-Japanese_.
	kLau*	The index of this character in _A Practical Cantonese-English Dictionary_.
	kMatthews	The index of this character in _Mathews' Chinese-English Dictionary_.
	kMeyerWempe*	The index of this character in the Student's Cantonese-English Dictionary.
	kMorohashi	The index of this character in the Dae Kanwa Ziten, aka Morohashi dictionary (Japanese) used in the four-dictionary sorting algorithm.
	kNelson	The index of this character in _The Modern Reader's Japanese-English Character Dictionary_.
	kPhonetic*	The phonetic index for the character from _Ten Thousand Characters: An Analytic Dictionary_.
	kSBGY	The position of this character in the Song Ben Guang Yun (SBGY) Medieval Chinese character dictionary (bibliographic and general information below).
	kCangjie*	The cangjie input code for the character. This incorporates data from the file cangjie-table.b5 by Christian Wittern.
Character Mapping	kBigFive	The Big Five mapping for this character in hex; note that this does not cover any of the Big Five extensions in common use, including the ETEN extensions.
	kCCCII	The CCCII mapping for this character in hex.
	kCNS1986	The CNS 11643-1986 mapping for this character in hex.
	kCNS1992	The CNS 11643-1992 mapping for this character in hex.
	kEACC	The EACC mapping for this character in hex.
	kGB0	The GB 2312-80 mapping for this character in ku/ten form.
	kGB1	The GB 12345-90 mapping for this character in ku/ten form.
	kGB3	The GB 7589-87 mapping for this character in ku/ten form.
	kGB5	The GB 7590-87 mapping for this character in ku/ten form.
	kGB7	The "General Use Characters for Modern Chinese" mapping for this character.
	kGB8	The GB 8565-89 mapping for this character in ku/ten form.
	kHKSCS	Mappings to the Big Five extended code points used for the Hong Kong Supplementary Character Set.
	kIBMJapan	The IBM Japanese mapping for this character in hex.
	kIRG_GSource	The IRG "G" source mapping for this character in hex. The IRG "G" source consists of data from the following national standards, publications, and lists from the People's Republic of China and Singapore.
	kIRG_HSource	The IRG "H" source mapping for this character in hex. The IRG "H" source consists of data from the Hong Kong Supplementary Character Set.
	kIRG_JSource	The IRG "J" source mapping for this character in hex. The IRG "J" source consists of data from the following national standards and lists from Japan.
	kIRG_KSource	The IRG "K" source mapping for this character in hex. The IRG "K" source consists of data from the following national standards and lists from the Republic of Korea (South Korea).
	kIRG_KPSource	The IRG "KP" source mapping for this character in hex. The IRG "KP" source consists of data from the following national standards and lists from the Democratic People's Republic of Korea (North Korea).
	kIRG_TSource	The IRG "T" source mapping for this character in hex. The IRG "T" source consists of data from the following national standards and lists from the Republic of China (Taiwan).
	kIRG_VSource	The IRG "V" source mapping for this character in hex. The IRG "V" source consists of data from the following national standards and lists from Vietnam.
	kJIS0213	The JIS X 0213-2000 mapping for this character in min,ku,ten form.
	kJis0	The JIS X 0208-1990 mapping for this character in ku/ten form.
	kJis1	The JIS X 0212-1990 mapping for this character in ku/ten form.
	kKPS0	The KP 9566-97 mapping for this character in hexadecimal form.
	kKPS1	The KPS 10721-2000 mapping for this character in hexadecimal form.
	kKSC0	The KS X 1001:1992 (KS C 5601-1989) mapping for this character in ku/ten form.
	kKSC1	The KS X 1002:1991 (KS C 5657-1991) mapping for this character in ku/ten form.
	kMainlandTelegraph	The PRC telegraph code for this character, derived from "Kanzi denpou koudo henkan-hyou".
	kPseudoGB1	A "GB 12345-90" code point assigned this character for the purposes of including it within Unihan.
	kTaiwanTelegraph	The Taiwanese telegraph code for this character, derived from "Kanzi denpou koudo henkan-hyou".
	kXerox	The Xerox code for this character.
Redundant	kCompatibilityVariant*	The compatibility decomposition for this ideograph, derived from the UnicodeData.txt file.

Other UCD Files

The following files in the Unicode Character Database are not used directly for Unicode properties. For more information about these files, see the referenced technical report(s), files, or section of Unicode Standard.

".txt" File	Description	N/I	Summary
Index	Chapter 16	I	Index to Unicode characters, as printed in the Unicode Standard.
NamesList	Chapter 16	I	This file duplicates some of the material in the UnicodeData file, and adds annotations used in the character charts.
NormalizationTest	UAX #15	N	Test file for conformance to Unicode Normalization Forms.
StandardizedVariants	Chapter 15	N	Lists all the standardized variant sequences that have been defined, plus a description of the desired appearance. StandardizedVariants.html contains this information, plus a sample glyph showing the desired features.

Derived Extracted Properties

The following files contain other properties of the UCD that are simply separated out, and listed in range format. These files are provided purely as a reformatting of existing data, with a certain exceptions listed below. They are all contained in a subdirectory called extracted.

Files

N/I

Definition and Generation

DerivedBidiClass*

From UnicodeData.txt, field 4

DerivedBinaryProperties*

From UnicodeData.txt, field 9. See Bidi Note.

DerivedCombiningClass*

From UnicodeData.txt, field 3

DerivedDecompositionType*

From the <tag> in UnicodeData.txt, field 5. For characters with canonical decomposition mappings (no tag), the value "canonical" is used.

* The value "canonical" is normative; the others are informative.

DerivedEastAsianWidth*

From EastAsianWidth.txt, field 1

DerivedGeneralCategory*

From UnicodeData.txt, field 2

DerivedJoiningGroup*

From ArabicShaping.txt, field 2

DerivedJoiningType*

From ArabicShaping.txt, field 1

DerivedLineBreak*

From LineBreak.txt, field 1.

* Some values are normative; some are informative. See UAX #11: Line Break Property for more information.

DerivedNumericType*

The property value is based on the contents of UnicodeData.txt, fields 6 through 8:

property value	non-empty fields
decimal	6, 7, & 8
digit	7 & 8
numeric	8

DerivedNumericValues*

Non-binary Property

From UnicodeData.txt, field 8

Bidi Note: The BidiMirrored property and the BidiMirroring property are different. The former is a normative property that indicates whether characters are mirrored in a right-to-left context in the Unicode Bidirectional Algorithm. The latter is an informative mapping of BidiMirrored characters, where possible, to characters that normally have the corresponding mirrored glyph.

Property Invariants

Values in the UCD are subject to correction as errors are found; however, some characteristics of the properties and files are considered invariants. Applications may wish to take these invariants into account when choosing how to implement character properties. The most important invariants are described in Unicode Policies. The following lists some additional invariants and more detail on some of the invariants in Unicode Policies.

UnicodeData Fields

The number of fields in UnicodeData.txt is fixed.
- Any additional information about character properties to be added in the future will appear in separate data files, rather than being added as an additional field or by subdivision or reinterpretation of existing fields.
The order of the fields is also fixed.

Combining Classes

Combining classes are limited to the values 0 to 255.
- In practice, there are far fewer than 256 values used. Implementations may take advantage of this fact for compression, since only the ordering of the non-zero values matters for the Canonical Reordering Algorithm. It is possible for up to 256 values to be used in the future; however, UTC decisions in the future may restrict the number of values to 128, since this has implementation advantages. [Signed bytes can be used without widening to ints in Java, for example.]
All characters other than those of General Category M* have the combining class 0.
- Currently, all characters other than those of General Category Mn have the value 0. However, some characters of General Category Me or Mc may be given non-zero values in the future.
- The precise values above the value 0 are not invariant--only the relative ordering of values is considered fixed. For example, it is not guaranteed in future versions that the class of U+05B4 will be precisely 14.

Decimal Digits

In Unicode 4.0 and thereafter, the General_Category value Decimal_Number (Nd), and the Numeric_Type value Decimal (de) are defined to be co-extensive, that is, the set of character having Nd will always be the same as the set of characters having de.

References

[FAQ]	Unicode Frequently Asked Questions http://www.unicode.org/faq/ For answers to common questions on technical issues.
[Glossary]	Unicode Glossary http://www.unicode.org/glossary/ For explanations of terminology used in this and other documents.
[Reports]	Unicode Technical Reports http://www.unicode.org/reports/ For information on the status and development process for technical reports, and for a list of technical reports.
[U4.0]	The Unicode Standard Version 4.0
[Versions]	Versions of the Unicode Standard http://www.unicode.org/versions/ For details on the precise contents of each version of the Unicode Standard, and how to cite them.

Modification History

This section provides a summary of the changes between update versions of the Unicode Standard. The modifications prior to Unicode 4.0 only listed changes in UnicodeData.txt. From 4.0 onward, the consolidated modifications include the changes in other files.

Unicode 4.0

UnicodeData.txt
- Decimal Digits
  - Numeric_Type=decimal digit now aligned with General_Category=Nd
- Modifier letters*
  - The general category of 02B9..02BA, 02C6..02CF changed to general category Lm.
Other Files
- New Properties and Values
  - Hangul_Syllable_Type, Unicode_Radical_Stroke
  - CJK numeric values added.
  - PropertyValueAliases adds block names
  - UCD fallback props more precisely defined, for code points not explicitly in data files
  - Added script value for Braille
  - New Linebreak properties: NL, WJ
- Khmer
  - Two Khmer characters are deprecated; four others strongly discouraged.
- Special Casing
  - Fixed for Turkish, Lithuanian
- Default Ignorables
  - Hangul Filler characters
  - Soft-Hyphen, CGJ, ZWS
  - Arabic End of Ayah and Syriac Abbreviation Mark no longer DI (their shaping classes are also fixed.)
- Grapheme_Extend
  - Removes halfwidth katakana marks, most Mc (except as needed for canonical equivalence)
- Stabilized Properties
  - The Hyphen property is now stabilized.

Unicode 3.2

Modifications made for Version 3.2.0 of UnicodeData.txt include:

Addition of 1016 new entries, to cover new characters encoded in Unicode 3.2.

Updated ISO 6429 names for control functions to match the currently published version of that standard.

Changed general category for Mongolian free variation selectors (U+180B..U+180D) from Cf to Mn.

Changed general category for U+0B83 TAMIL SIGN VISARGA (aytham) from Mc to Lo.

Changed general category for U+06DD ARABIC END OF AYAH from Me to Cf.

Changed general category for U+17D7 KHMER SIGN LEK TOO from Po to Lm.

Changed general category for U+17DC KHMER SIGN AVAKRAHASANYA from Po to Lo.

Changed canonical decomposition for U+F951 from 96FB to 964B (see Corrigendum #3: U+F951 Normalization).

Unicode 3.1.1

Modifications made for Version 3.1.1 of UnicodeData.txt include:

Modification of ISO 10646 annotation regarding Greek tonos, affecting entries for U+0301 and U+030D.

Unicode 3.1

Modifications made for Version 3.1.0 of UnicodeData.txt include:

Addition of 2237 new entries, to cover new characters and new ranges of unified Han characters encoded in Unicode 3.1.
Changed General Category value of 16EE..16F0 (Runic golden numbers) from No to Nl.

Unicode 3.0.1

Modifications made for Version 3.0.1 of UnicodeData.txt include:

Added 5- and 6-digit representation of code points past U+FFFF.
Added Private Use range definitions for Planes 15 and 16.
Minor additions for the 10646 comment field.

Unicode 3.0.0

Modifications made for Version 3.0.0 of UnicodeData.txt include many new characters and a number of property changes. These are summarized in Appendix D of The Unicode Standard, Version 3.0.

Unicode 2.1.9

Modifications made for Version 2.1.9 of UnicodeData.txt include:

Corrected combining class for U+05AE HEBREW ACCENT ZINOR.
Corrected combining class for U+20E1 COMBINING LEFT RIGHT ARROW ABOVE
Corrected combining class for U+0F35 and U+0F37 to 220.
Corrected combining class for U+0F71 to 129.
Added a decomposition for U+0F0C TIBETAN MARK DELIMITER TSHEG BSTAR.
Added decompositions for several Greek symbol letters: U+03D0..U+03D2, U+03D5, U+03D6, U+03F0..U+03F2.
Removed decompositions from the conjoining jamo block: U+1100..U+11F8.
Changes to decomposition mappings for some Tibetan vowels for consistency in normalization. (U+0F71, U+0F73, U+0F77, U+0F79, U+0F81)
Updated the decomposition mappings for several Vietnamese characters with two diacritics (U+1EAC, U+1EAD, U+1EB6, U+1EB7, U+1EC6, U+1EC7, U+1ED8, U+1ED9), so that the recursive decomposition can be generated directly in canonically reordered form (not a normative change).
Updated the decomposition mappings for several Arabic compatibility characters involving shadda (U+FC5E..U+FC62, U+FCF2..U+FCF4), and two Latin characters (U+1E1C, U+1E1D), so that the decompositions are generated directly in canonically reordered form (not a normative change).
Changed BIDI category for: U+00A0 NO-BREAK SPACE, U+2007 FIGURE SPACE, U+2028 LINE SEPARATOR.
Changed BIDI category for extenders of General Category Lm: U+3005, U+3021..U+3035, U+FF9E, U+FF9F.
Changed General Category and BIDI category for the Greek numeral signs: U+0374, U+0375.
Corrected General Category for U+FFE8 HALFWIDTH FORMS LIGHT VERTICAL.
Added Unicode 1.0 names for many Tibetan characters (informative).