W3C Unicode

Unicode in XML and other Markup Languages

Proposed DRAFT Unicode Technical Report #20

W3C Working Draft 28-September-1999

Revision (Unicode):
2
This version:
http://www.unicode.org/unicode/reports/tr20/tr20-2.html
http://www.w3.org/TR/1999/WD-unicode-xml-19990928
Latest version:
http://www.unicode.org/unicode/reports/tr20
http://www.w3.org/TR/unicode-xml
Previous version:
http://www.unicode.org/unicode/reports/tr20/tr20-1.html
(no previous public version for W3C)
Date (Unicode):
1999-09-28
Authors:
Martin Dürst (mduerst@w3.org), Mark Davis (mark@unicode.org), Hideki Hiura (hideki.hiura@eng.sun.com), and Asmus Freytag (asmus@unicode.org)

Summary/Abstract

This document contains guidelines on the use of the Unicode Standard in conjunction with markup languages such as XML.

Status of this document (Unicode Consortium)

This proposed draft is published for review purposes. This draft has been considered by the Unicode Technical Committee and approved as proposed draft for internal review by Unicode Members and members of W3C Internationalization WG. At its next meeting, the Unicode Technical Committee may approve, reject, or further amend this document. It is intended that this document will become a joint Unicode - W3C document.

The content of technical reports must be understood in the context of the latest version of the Unicode Standard. See http://www.unicode.org/unicode/standard/versions/ for more information.

This document does not, at this time, imply any endorsement by the Consortium's staff or member organizations. Please mail comments to unicore@unicode.org.

Status of this document (W3C)

This is a W3C Working Draft worked on jointly by the W3C Internationalization Working Group/Interest Group (Members only) and the Unicode Technical Committee. For public discussion of this working draft, please use the mailing lists www-international@w3.org and unicode@unicode.org (please crosspost to both lists). For internal discussions, please use the relevant mailing list (again with crossposting). Please send editorial comments to the authors.

The material in this draft is still in a rather early stage. Currently the draft shows the approximate range of intended coverage (e.g. in terms of which kinds of characters will be addressed, and what kind of information that is intended to be provided for each kind), while large parts still need more work and discussion. It is not exactly clear yet what the exact proposal for each character may be, and how this document will be related to other W3C specifications. One potential way to proceed is to work towards publishing this document as a Note, and to reference it, normatively or otherwise, from the Character Model [CharMod] document.

Publication as a Working Draft does not imply endorsement by the W3C membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite W3C Drafts as other than "work in progress". A list of current W3C working drafts can be found at http://www.w3.org/TR.

Table of Contents

  1. Introduction
  2. General Considerations
  3. List of Characters
  4. Characters with compatibility mappings
  5. Versioning
  6. Conformance
  7. References
  8. Change History
  9. Copyright

1. Introduction

The Unicode Standard contains a large number of characters in order to cover the scripts of the world. It also contains characters for compatibility with older character encodings, and characters with control-like functions included for various reasons. It also provides specifications for use of these characters.

For document and data interchange, the Internet and the World Wide Web are more and more making use of marked-up text. In many instances, markup provides the same, or essentially similar features to those provided by formatting characters in the Unicode Standard [Unicode] for use in plain text. While there may be valid reasons to support these characters and their specifications in plain text, their use in marked-up text can conflict with the rules of the markup language.

[a more extensive overview of Unicode and markup will be added to level out the background of various audiences]

1.1 Notation

This report uses XML [XML] as a prominent and general example of markup. The XML namespace notation [Namespace] is used to indicate that a certain element is taken from a specific markup language. As an example, the prefix 'html:' indicates that this element is taken from [XHTML]. This means that the examples containing the namespace prefix 'html:' are assumed to include a namespace declaration of xmlns:html="..." [Ed. note: insert the appropriate URI for XHTML later].

Characters are denoted using the notation used in the Unicode Standard, i.e. U+ followed by their hexadecimal number such as "U+1234". In XML or HTML this would be expressed as "ሴ". [Should this be replaced by the XML convention? Probably not, because we don't want to see these in XML :-)]

2. General Considerations

This chapter will contain general considerations regarding control-like characters in markup. In particular, it is planned to address the following points:

3. List of Characters

The following table contains the characters currently considered not suitable for use with markup. Each category is further discussed below.

Codepoints

Names/Description

Short Comment

U+202A .. U+202E BIDI embedding controls (LRE, RLE, LRO, RLO, PDF) Strongly discouraged in [HTML 4.0]; RLM and LRM are allowed
U+2028 .. U+2029 Line and paragraph separator use <html:br />, <html:p></html:p>, or equivalent
U+206A .. U+206B Activate/Inhibit Symmetric swapping Deprecated  in Unicode 3.0
U+206C .. U+206D Activate/Inhibit Arabic form shaping Deprecated in Unicode 3.0
U+206E .. U+206F Activate/Inhibit National digit shapes Deprecated in Unicode 3.0
U+FFF9 .. U+FFFB Interlinear annotation characters Use ruby markup [Ruby]
U+FFFC Object replacement character Use markup, e.g. HTML <object> or HTML <img>
U-000E0000 .. U-000E007F Language Tag codepoints (if and when they will be encoded) Use html:lang or xml:lang

A later version of this document will discuss each of the character categories. For each of the categories/characters, the following points may be discussed:

The following subsection gives an example:

3.1 Object Replacement Character, U+FFFC

Short description: The object replacement character is used to stand in place of an object (e.g. an image) included in a text.

Reason for inclusion: The object replacement character was included in Unicode only in order to reserve a codepoint for a very frequent application-internal use. Many text-processing applications store the text and the associated markup (or in some cases styling information) of a document in separate structures. The actual text is kept in a single linear structure; additional information is kept separately with pointers to the appropriate text positions. The overall implementation makes sure that these two structures are kept in sync. If the text contains objects such as images, it is extremely helpful for implementations to have a sentinel in the text itself; any additional information is kept separately.

Problems when used in markup: Including an object replacement character in markup text does not work because the additional information (what object to include,...) is not available.

Problems with other uses: The object replacement character is also problematic when used in plain text, because there is no way in plain text to provide the actual object information or a reference to it.

Replacement markup: The markup to be used in place of the Object Replacement Character depends on the object in question and the markup context it is used in. Typical cases are <html:img src'...' />, <html:object ...>, or <html:applet ...>. These constructs allow to provide all additional information needed to identify and use the object in question.

What to do if detected: In a proxy context context, ignore. In a browser context, treat as either a missing image, or a REPLACEMENT CHARACTER When received in an editing context, if the actual object is accessable, replace the character by the appropriate markup for that object. Otherwise remove, ideally providing a warning.

3.2 Interlinear Annotation Characters, U+FFF9-U+FFFB

Short description: The interlinear annotation characters are used to delimit interlinear annotations in certain circumstances.

Reason for inclusion: The interlinear annotation characters were included in Unicode only in order to reserve codepoints for very frequent application-internal use.The interlinear annotation characters are used to delimit interlinear annotations in contexts where other delimiters are not available, and where non-textual means exist to carry formatting information. Many text-processing applications store the text and the associated markup (or in some cases styling information) of a document in separate structures. The actual text is kept in a single linear structure; additional information is kept separately with pointers to the appropriate text positions. This is called out-of-band information. The overall implementation makes sure that these two structures are kept in sync. If the text contains interlinear annotations, it is extremely helpful for implementations to have delimiters in the text itself; even though delimiters are not otherwise used for style markup.With this method, and unlike the case of the object replacement character, all textual information can remain in the standard text stream, but any additional formatting information is kept separately. In addition, the Interlinear Annotation Anchor serves as a place holder for formatting information for the whole annotation object, the same way a paragraph mark can be a placeholder to attach paragraph formatting information.

Problems when used in markup: Including interlinear annotation characters in markup text does not work because the additional formatting information (how to position the annotation,...) is not available.

Problems with other uses: The interlinear annotation characters are also problematic when used in plain text, and are not intended for that purpose. In particular, on older display systems that ignore or replace the Interlinear Annotation Characters, the meaning of the text may be changed.

Replacement markup: The markup to be used in place of the Interlinear Annotation Characters depends on the formatting an nature of the interlinear annotation in question. For ruby, please see [Ruby].

What to do if detected: In a proxy context or browser context, remove U+FFF9 and remove all characters between U+FFFA and following U+FFFB. When received in an editing context, either remove in the same manner, maybe with a warning to the user, or convert into appropriate ruby markup for further editing and formatting by the user.

4. Characters with compatibility mappings

The Unicode Standard provides compatibility mappings for a number of characters. Compatibility mappings indicate a relationship to another character, but the exact nature of the relationship varies. In some cases the relationship means "is based on" in some other cases it denotes a property. When plain text is marked up, it may make sense to map some of these characters to their compatibility equivalents and suitable markup. It is important to understand the nature of the distinctions between characters and their compatibility equivalents and the context in where these distintions matter. It is never advisable to apply compatibility mappings indiscriminantly. This section provides guidance on when and how to apply compatibility mappings. It is organized by the "compatibility tag" associated with each compatibility mapping.

4.1 Overview

The following table gives an overview of the various compatibility characters, organized by "compatibility tag". The first column contains the tag value of the "compatibility tag" from the Unicode database. Although these tags use "<" and ">", they should not be confused with XML tags. Code range indicates which codepoints the entry applies to. Substitute indicates whether the codes can be substituted using the compatibility equivalent according to Normalization Form KC of [UTR 15]. Markup indicates the available markup. For some cases, instead of or in addition to markup, style information [CSS2] is needed. [Discussion about style info to be added in the future.]
Tag value Code range Substitute Markup Comment
<vertical> all yes none Presentation forms
<initial> all yes none Presentation forms
<medial> all yes none Presentation forms
<final> all yes none Presentation forms
<isolated> all yes none Presentation forms
<super> all yes <sup>
<sub> all yes <sub>
<small> all no none Precise usage unknown. Maintain, but don't generate
<no-break> all no none The compatibility mapping is merely a way to indicate the equivalent character that is not non-breaking. The distinction
<font> all no none Variant forms that are used as symbols
<compat> 2100-2101 no none Variant forms that are used as symbols
2105-2106 no none Variant forms that are used as symbols
2121 yes ?hiv? For use as single code point in vertical layout
2160-2175 yes ?hiv? For use as single code point in vertical layout
3131-318E no none Do not-conjoin
2000-200A no none No equivalent markup exists for spaces
3200-3243 ? ?hiv? String used as symbol in vertical layout
249C-24B5 ? ?hiv? String used as symbol in vertical layout
2474-249B yes bullet style Number used as symbol in vertical layuot
2155-215F yes none As long as fraction slash is supported!
00BC-00BE yes none As long as fraction slash is supported!
all other no none Maintain, semantic distinctions apply
<circled> all no none Bullets or dingbats analogous to 2776-2793
<squared> 3358-337D yes? ?hiv? For use as single code point in vertical layout
<squared> 33E0-33FE yes? ?hiv? For use as single code point in vertical layout
32C0-32CB yes? ?hiv? For use as single code point in vertical layout
33A7 ? Variant form used as symbol in vertical layout
33A8 ? Variant form used as symbol in vertical layour
33AE-33AF ? Variant forms used as symbols in vertical layout
33C6 ? Variant form used as symbol in vertical layout
3300-3357 yes ?sqared? Multiline cluster for vertical layout
<narrow> all no none No equivalent markup exists
<wide> all no none No equivalent markup exists

Notes

At the time of this writing it was not known what the appropriate markup would be for squared kana clusters or horizontal in vertical symbols.

4.2 Generating characters

Presentation forms and characters for which adequate representation exists as marked up text should never be generated for new data. Many of the characters with <font> tag are suitable for new data, as long as they are used in the manner they are intended, that is as symbols, with definite semantic differentiation between the different forms. They should not be used to create styled text, but styled text should not be used to carry the essential semantic distinction needed for example for mathematics.

4.3 Bullets

[Ed. Note: this is an example of a detail section for particular compatibility characters.]

Short description: Characters with a <circled> tag or characters with <compat> tag and compatibility mapping to a parenthesized string.

Reason for inclusion: They are most frequently used for enumerated bullets, but the characters with a <circled> tag often occur as dingbats or footnote markers in tables.

Problems when used in markup: These characters do not cause undue interaction with markup

Problems with other uses: None

Replacement markup: (bullet style) When generating marked up text these characters occur only internal to the user agent as bullet styles are rendered. When marking up plain text data they could be converted to suitable bullet styles, if such use can be properly inferred.

Compatibility mappings of the form (n) or (n.) can be kept as single characters, or replaced by bullet styles. A conversion to bullet styles allows a simple extension of the set to arbitrary numbers. This is in contrast to circled characters: Very few browsers can properly generate arbitrary circled numbers, therefore conversion to bullet styles does not easily allow an extension of the set of accessible circled numbers.

What to do if detected: In a proxy context or browser context no action needs to be taken.When received in an editing context, substitution of a bullet style may be appropriate. However, the same characters are very often used as dingbat-like symbols in tables, so the user should have the choice of whether to replace.

4.4 [Template]

Short description:

Reason for inclusion:

Problems when used in markup:

Problems with other uses:

Replacement markup:

What to do if detected: In a proxy context or browser context...... When received in an editing context,.... .

5. Versioning

This technical report covers all relevant characters in the Unicode Standard, Version 3.0.

As the Unicode standard is updated and new characters get added, new characters that are not suitable for markup may also be added. However, the Unicode Technical Committee only introduces such characters where there is a very strong industry requirement. As markup becomes more prevalent, the need for such characters is reduced substantially. This report itself may be updated periodically to give additional background information.

For more information, see:

6. Conformance

In the context of the Unicode Standard, the material in this technical report is informative. However, other documents, particularly markup language specifications, may specify conformance including normative references to this document.

7. References

[Charmod]
Martin J. Dürst, Character Model for the World Wide Web, W3C Working Draft 25-Feb-1999, <http://www.w3.org/TR/WD-charmod>.
[CharReq]
Martin J. Dürst, Requirements for String Identity and Character Indexing Definitions for the WWW, W3C Working Draft 10-July-1998, <http://www.w3.org/TR/WD-charreq>.
[CSS2]
Bert Bos, Håkon Wium Lie, Chris Lilley, Ian Jacobs, Eds., Cascading Style Sheets, level 2 (CSS2 Specification), W3C Recommendation 12-May-1998, <http://www.w3.org/TR/REC-CSS2>.
[HTML 4.0]
Dave Raggett, Arnaud Le Hors, Ian Jacobs, Eds., HTML 4.0 Specification, W3C Recommendation 18-Dec-1997 (revised on 24-Apr-1998), <http://www.w3.org/TR/REC-html40/>.
[Namespace]
Tim Bray, Dave Hollander, Andrew Layman, Namespaces in XML, W3C Recommendation 14-Jan-1999, <http://www.w3.org/TR/REC-xml-names/>.
[Ruby]
Marcin Sawicki, Michel Suignard, Masayasu Ishikawa, Martin Dürst, Eds., Ruby, W3C Working Draft 24-Sept-1999, <http://www.w3.org/TR/1999/WD-ruby-19990924/>.
[Unicode]
The Unicode Standard, Version 3.0, Addison Wesley, Reading MA, 2000 [ISBN to be assigned].
[UTR 15]
Mark Davis, Martin Dürst, Unicode Technical Report #15, Unicode Normalization Forms, <http://www.unicode.org/unicode/reports/tr15/>.
[XHTML]
XHTML™ 1.0: The Extensible HyperText Markup Language - A Reformulation of HTML 4.0 in XML 1.0, W3C Proposed Recommendation 24-Aug-1999, <http://www.w3.org/TR/1999/PR-xhtml1-19990824/>.
[XML 1.0]
Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eds., Extensible Markup Language (XML) 1.0, W3C Recommendation 10-February-1998, <http://www.w3.org/TR/REC-xml>.

8. Change History (last changes first)

Changes from http://www.unicode.org/unicode/reports/tr20/tr20-1.html: Completed references, linked TOC. Various wording changes. Added W3C WD stylesheet, logo, copyright, status of this document. Streamlined authors' section. (MJD)

Added material on compatibility characters. (AF)

Changes from the initial draft: Fixed the header. Fixed the numbering. Fixed the title. Put references to final version of data files based on naming conventions. Minor wording changes. Added proposed language on annotation characters to match example on FFFC. Posted for internal review by UTC and W3C (AF)

9. Copyright

Copyright © 1999-1999 jointly held by Unicode, Inc. and W3C® (MIT, INRIA, Keio), All Rights Reserved.

W3C Copyright, liability, trademark, document use and software licensing rules apply.

The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.


Unicode Home Page: http://www.unicode.org

Unicode Technical Reports: http://www.unicode.org/unicode/reports/