Appendix B: Accessing code point boundaries

Mark Davis, IBM

Lauren Wood, SoftQuad Software Inc.

2.1. Introduction
2.2. Methods
- StringExtend

B.1: Introduction

This appendix is an informative, not a normative, part of the Level 2 DOM specification.

Characters are represented in Unicode by numbers called code points (also called scalar values). These numbers can range from 0 up to 1,114,111 = 10FFFF16 (although some of these values are illegal). Each code point can be directly encoded with a 32-bit code unit. This encoding is termed UCS-4 (or UTF-32). The vast majority of world text will, however, be represented with values less than FFFF16, so another encoding is generally used for these code points in order to save space. This encoding is called UTF-16. The most frequent characters are represented in UTF-16 by a single 16-bit code unit, while characters above FFFF16 use a special pair of code units called a surrogate pair. For more information, see [Unicode] or the Unicode Web site.

While in practice indexing by code points as opposed to code units is not common in programs, some specifications such as XSL use code point indices. For communicating with such specifications it is recommended that the programming language provide string processing methods for converting code point indices to code unit indices and back. Some languages do not provide these functions natively; for these it is recommended that the native String type that is bound to DOMString be extended to enable this conversion. An example of how such an API might look is supplied below.

Note: Since these methods are supplied as an illustrative example of the type of functionality that is required, the names of the methods, exceptions, and interface may differ from those given here.

B.2: Methods

Interface StringExtend

Extensions to a language's native String class or interface

IDL Definition

interface StringExtend {
  int                findOffset16(in int offset32)
                                        raises(StringIndexOutOfBoundsException);
  int                findOffset32(in int offset16)
                                        raises(StringIndexOutOfBoundsException);
};

Methods

findOffset16

Returns the UTF-16 offset that corresponds to a UTF-32 offset. Used for random access.

Note: You can always roundtrip from a UTF-32 offset to a UTF-16 offset and back. You can roundtrip from a UTF-16 offset to a UTF-32 offset and back if and only if the offset16 is not in the middle of a surrogate pair. Unmatched surrogates count as a single UTF-16 value.

Parameters

int

offset32

UTF-32 offset.

Return Value

int

UTF-16 offset

Exceptions

StringIndexOutOfBoundsException

if offset32 is out of bounds.

findOffset32

Returns the UTF-32 offset corresponding to a UTF-16 offset. Used for random access. To find the UTF-32 length of a string, use:

len32 = findOffset32(source, source.length());

Note: If the UTF-16 offset is into the middle of a surrogate pair, then the UTF-32 offset of the end of the pair is returned; that is, the index of the char after the end of the pair. You can always roundtrip from a UTF-32 offset to a UTF-16 offset and back. You can roundtrip from a UTF-16 offset to a UTF-32 offset and back if and only if the offset16 is not in the middle of a surrogate pair. Unmatched surrogates count as a single UTF-16 value.

Parameters

int

offset16

UTF-16 offset

Return Value

int

UTF-32 offset

Exceptions

StringIndexOutOfBoundsException

if offset16 is out of bounds.

Appendix B: Accessing code point boundaries

Table of contents

B.1: Introduction

B.2: Methods