HEX BYTE PICTURES FOR UNICODE Frank da Cruz The Kermit Project Columbia University New York City USA fdc@columbia.edu http://www.columbia.edu/kermit/ Tue Nov 10 00:00:00 1998 THIS IS A PREFORMATTED PLAIN-TEXT ASCII DOCUMENT. IT IS DESIGNED TO BE VIEWED AS-IS IN A FIXED-PITCH FONT. ITS WIDEST LINE IS 79 COLUMNS. IT CONTAINS NO TABS. IF IT LOOKS MESSY TO YOU, PLEASE FEEL FREE TO PICK UP A CLEAN COPY OF THIS OR THE RELATED PROPOSALS BY ANONYMOUS FTP: HEX BYTE PICTURES FOR UNICODE (plain text) ftp://kermit.columbia.edu/kermit/ucsterminal/hex.txt ADDITIONAL CONTROL PICTURES FOR UNICODE (plain text) ftp://kermit.columbia.edu/kermit/ucsterminal/control.txt TERMINAL GRAPHICS FOR UNICODE (plain text) ftp://kermit.columbia.edu/kermit/ucsterminal/ucsterminal.txt Glyph Map (PDF, contributed by Michael Everson) ftp://kermit.columbia.edu/kermit/ucsterminal/terminal-emulation.pdf Clarification of SNI Glyphs (Microsoft Word 7.0) ftp://kermit.columbia.edu/kermit/ucsterminal/sni-charsets.doc Discussion (plain text) ftp://kermit.columbia.edu/kermit/ucsterminal/mail.txt (Note, the Exhibits are on paper and not available at the FTP site.) ABSTRACT A set of characters is proposed for encoding 8-bit values and for displaying them in single cells for debugging and analysis purposes. Please refer to the TERMINAL GRAPHICS FOR UNICODE proposal for a discussion of terminal emulation, including motivation for supporting it in Unicode, as well as for acknowledgements to those who helped with this set of proposals. CONTENTS 1. THE CASE FOR HEX BYTE CHARACTERS 1.1. Hex Byte Pictures in Terminal Emulation 1.2. Hex Byte Pictures for Debugging 1.3. Hex Byte Glyphs for Unknown Characters 1.4. Hex Byte Characters for Data Exchange 1.5. Standard Codes Are Needed 2. CHARACTER AND GLYPH REPERTOIRE 3. REFERENCES 4. EXHIBITS NOTATION . Numbers in (parentheses) are footnote references, keyed to footnotes at the bottom of the section in which they appear. . Numbers in [brackets] are keyed to the References in Section 3. . Letter-Digit in brackets refers to an Exhibit in Section 4. For consistency, the References and Exhibits are the same as those in the accompanying, even though most of the items are not referenced here. 1. THE CASE FOR HEX BYTE CHARACTERS A set of 256 hex byte-value picture characters is proposed for compatibility with existing terminals, line monitors, and protocol analyzers; for use in debugging of Unicode applications; and for data exchange with non-Unicode applications. 1.1. Hex Byte Pictures in Terminal Emulation Certain real physical terminals can show byte values as 2 hexadecimal digits in a single screen cell. These include: . DEC VT220 [5,6] in Display Controls Mode, uses the 32 hex byte pictures, 80-9F, to represent the 32 C1 control characters [A1-A2]. . DEC VT320 and above [7,8,9] use hex bytes 80-83 and 98-9A when displaying C1 control values in these ranges (and mnemonics for the others) [B1,B2,C1]. . Siemens Nixdorf 97801 includes 00 through 1F in its "IBM" character set [E4], and 80-9F in its character ROM [E6]. To emulate these terminals accurately, therefore, requires 32 hex-byte glyphs, 00-1F and 80-9F. 1.2. Hex Byte Pictures for Debugging The widespread use of hex byte glyphs by protocol analyzers (e.g. see [N1]) and line monitors (increasingly PC-based) suggests a possible motivation for encoding all 256 possible hex bytes. Once encoded, these glyphs could also be used in terminal-emulator debug screens, word processors, file dump and analysis programs, Web browsers, and so on, for unambiguously showing the value of a given byte (or byte pair = Unicode character = 2 hex byte glyphs) in a data stream, buffer, or file. 1.3. Hex Byte Glyphs for Unknown Characters Hex byte characters offer a solution to the increasingly common problem of unmappable characters when converting to Unicode from another character set. Presently, unmappable characters are handled (in Web browsers, word processors, etc) in most cases by substituting or displaying the U+FFFD Replacement Character. In many cases this is adequate for display purposes. But developers, help-desk and support personnel, and even end-users could benefit from seeing the actual value. It could aid them, for example, in identifying the source character set and choosing the correct mapping, or in sending precise problem reports to misbehaving websites, etc. Mapping unknown characters to Unicode characters keyed to their specific byte values would allow corrections to be made in partially converted documents, e.g. by search and replace in a Unicode editor or other Unicode-based text utility. Displaying the actual hex byte value in a single character cell allows (in most cases) a mixture of valid and invalid characters in monospaced screen displays without disrupting the formatting, e.g. of tabular information. 1.4. Hex Byte Characters for Data Exchange When textual information is transferred from a non-Unicode host or application to a Unicode one, and the mapping from source to destination character set is incomplete or unknown, substitution of hex byte-value characters for the unknown source characters allows round-trip integrity without a need for the higher-level protocols that would otherwise be necessary, and which would no doubt proliferate and cause much unneeded labor and confusion. 1.5. Standard Codes Are Needed A standard and uniform set of hex byte value characters and associated glyphs would allow any maker of Unicode-base software software to include debug / trace / dump or unmappable-character preservation capabilities simply by using standard Unicode characters (which would presumably find their way into standard fonts) for this purpose, rather than having to create mutually incompatible custom encodings and fonts. This would allow copying and pasting into other applications, including into tech-support email, with the reasonable expectation that the hex bytes would arrive intact and this, in turn, should promote faster problem resolution and increased standards compliance. Without a standard encoding, problem resolution and technical support in this area will remain the ordeal they are today, especially for the naive end-user. 2. CHARACTER AND GLYPH REPERTOIRE One glyph is required for each hex byte code 00 through FF; 256 glyphs in all, as shown in Table 2.1, in which the "Code" column shows the temporary reference value for this document, E100-E1FF. Ideally (for efficiency in real-time debugging/display applications), the final 8 bits of the actual code would correspond to the 8-bit value represented by the corresponding glyph, as they do in the sample codes. Table 2.1: Hex Byte Characters Code Byte Description E100 00 Symbol for Hex Byte 00 E101 01 Symbol for Hex Byte 01 : : : E1FF FF Symbol for Hex Byte FF (1) These characters should have the following properties: Case: No Combining Class: 0 Combining Jamo: No Directionality: Other Neutral (ON) Jamo Short Name: No Numeric Value: No (2) Private Use: No Surrogate: No Mirrored: No Mathematical: No Notes: (1) Hex byte values can collide with control-character names: FF, D1, D2, D3, D4, etc, from the control-pictures sets proposed in ADDITIONAL CONTROL PICTURES FOR UNICODE. If both hex bytes and control pictures are implemented, the font designer should ensure they are distinct enough visually that they will not be confused. (2) I do not have a strong opinion as to whether these characters should have the Numeric Value property; a case could be made either way. To prevent cell-boundary ambiguity, the font designer should employ some visual device to bind the two hex digits together in an unmistakable way, for example by arranging them diagonally within the character cell as shown in Figure 2.1. Figure 2.1: Suggested Glyph Format +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+--+ |0 | |0 | |0 | ... |0 | |1 | |1 | |1 | ... |E | |F | ... |F |F | | 1| | 2| | 3| | F| | 0| | 1| | 2| | F| | 0| | E| F| +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+--+ Summary: 256 new characters, U+E100 through U+E1FF. Status: Controversial. Should this proposal be rejected, a smaller selection of hex bytes is still required for the C1 control pictures set and for SNI "IBM" character-set glyphs: 00-1F and 80-9F (32 characters). 3. REFERENCES [1] American National Standards Institute, ANSI X3.4-1986, Code for Information Interchange (ASCII), 1986. [2] Data General, Programming the Display Terminal: Models D217, D413, and D463, Westboro, MA, 1991. [3] Digital Equipment Corporation, VT100 User Guide, EK-VT100-UG-002, Maynard, MA, 1979. [4] Digital Equipment Corporation, VT102 Video Terminal User Guide, EK-VT102-UG-003, Maynard, MA, 1982. [5] Digital Equipment Corporation, VT220 Owner's Manual, EK-VT220-UG-003, Maynard, MA, 1984. [6] Digital Equipment Corporation, VT220 Series Programmer Reference Manual, EK-VT240-RM-002, Maynard, MA, 1984. [7] Digital Equipment Corporation, VT330/VT340 Programmer Reference Manual, Volume 1: Text Programming, ED-VT3XX-TP-002, Maynard, MA, 1988. [8] Digital Equipment Corporation, Installing and Using the VT420 Video Terminal EK-VT420-UG.002, Maynard, MA, 1988. [9] Digital Equipment Corporation, VT520/VT525 Video Terminal Programmer Inforamtion, EK-VT520-RM.A01, Maynard, MA, 1994. [10] Heathkit Manual for the Video Terminal Model H19, The Heath Company, Benton Harbor, MI, 1979. [11] Hewlett Packard 2621A/P Interactive Terminal Owner's Manual, 1978. [12] Hewlett Packard 2648A Graphics Terminal Reference Manual, 1977. [13] IBM System/360 Principles of Operation, GA22-6821-8, Poughkeepsie, NY, 1970. [14] IBM National Language Design Guide, Volume 2: National Language Support Reference Manual, 4th Edition, SE09-8002-03, North York ON, 1994. [15] IBM 3270 Information Display System, Component Description, GA27-2749-10, 1980. [16] IBM 3164 ASCII Color Display Station Description, GA18-2317-1, 1986. [17] ISO International Standard 2022, Information processing -- ISO 7-bit and 8-bit coded character sets -- Code extension techniques, Third Edition, Geneva, 1986. [18] ISO/IEC International Standard 6429, Information technology -- Control functions for coded character sets, Third Edition, Geneva, 1992. [19] ISO/IEC 10646-1, International Standard 10646, Information Processing -- Multiple-Octet Coded Character Set, 1993-now. [20] Perkin Elmer Model 1100 User's Manual, Randolph, NJ, 1978. [21] Siemens Nixdorf, Bildschirmeinheit 97801-5xx Schnittstellen, Benutzerhandbuch, München, 1991. [22] Televideo 922 Video Terminal Display Operator's Manual, Sunnyvale, CA, 1984. [23] Televideo 965 Video Terminal Display Operator's Manual, Sunnyvale, CA, 1988. [24] The Unicode Standard, Version 2.0, Addison-Wesley Developers Press, 1996. [25] Wyse WY-60 Programmer's Guide, Wyse Technology, San Jose, CA, 1987. [26] Wyse WY-370 Programmer's Guide, Wyse Technology, San Jose, CA, 1990. [27] IBM 3270 Information Display System, Data Stream Programmer's Reference, GA23-0059-06, 1991. [28] ISO International Register of Coded Characters to Be Used with Escape Sequences, European Computer Manufacturers Association (ECMA), Geneva, 1985-present. [29] IBM Character Data Representation Architecture, Level 1 Registry, IBM Canada Ltd., National Language Technical Centre, Ontario, SC09-1391-00, 1990 (superseded by: IBM Character Data Representation Architecture, Registration and Registry, IBM Canada Ltd., Toronto, SC09-2190-00, 1995). [30] Knuth, Donald, "TeX and METAFONT, New Directions in Typesetting", American Mathematical Society / Digital Press, Bedford MA, 1979. [31] Apple Computer Corporation, Inside Macintosh, 1984. [32] HDS-3200 Terminal Series Owner's Manual, Philadelphia PA, 1987. [33] Zenith Data Systems Video Terminal Z-19-CN Operation Manual, Saint Joseph, MI, 1981. [34] Interview 30A/40A Operator's Field Reference Guide, Atlantic Research Corporation, ATLC-107-919-101, Alexandria, VA, 1982. 4. EXHIBITS The following exhibits, available only on paper, are reproduced from the terminal manuals indicated by the numeric reference number. Each exhibit is 1 page unless otherwise indicated. [A1] VT220 Display Controls Font (Left Half) [5]. [A2] VT220 Display Controls Font (Right Half) [5]. [A3] VT220 DEC Special Graphics Character Set [5]. [B1] VT320 Display Controls Font (Left Half) [7]. [B2] VT320 Display Controls Font (Right Half) [7]. [C1] VT420 Display Controls Font (Both Halves) [8]. [C2] VT420 DEC Technical Character Set [8]. [C3] HDS-3200 DEC Technical Character Set [32]. [D1] Data General US ASCII Character Set [2]. [D2] Data General Word-Processing, Greek, and Math Character Set [2]. [D3] Data General Line Drawing Character Set [2]. [D4] Data General Special Graphics Character Set [2]. [D5] Data General VT Multinational Character Set [2]. [D6] Data General VT Special Graphics Character Set [2]. [D7] Data General ISO 8859/1.2 Character Set [2]. [E1] Siemens Nixdorf 97801 ISO 8859-1 Character Set [21]. [E2] Siemens Nixdorf 97801 Klammern (Brackets) Character Set [21]. [E3] Siemens Nixdorf 97801 Facet Character Set [21]. [E4] Siemens Nixdorf 97801 IBM Character Set [21]. [E5] Siemens Nixdorf 97801 Math Character Set [21]. [E6] Siemens Nixdorf 97801 Character Generator (8 pages) [21]. [F1] Wyse 60 Native, Multinational, PC, and ASCII Character Sets [25]. [F2] Wyse 60 Graphics 1, 2, and 3 Character Sets [25]. [F3] Wyse 60 Standard ANSI, ANSI Graphics, and UK ANSI Character Sets [25]. [G1] Wyse 370 Controls Display Mode (74Hz) [26]. [G2] Wyse 370 Controls Display Mode (60Hz) [26]. [G3] Wyse 370 C0, ASCII, and Special Graphics Character Sets [26]. [G4] Wyse 370 C1, Multinational, and Latin-1 Character Sets [26]. [H1] IBM 3270 Operator Information Area Symbols (10 pages) [15]. [I1] TeX Standard Extension Font [30]. [J1] Apple Symbol Font (2 pages) [31]. [K1] Hewlett Packard 2621A/P National Terminal Character Set [11]. [L1] Heath/Zenith-19 Graphic Symbols (2 pages) [33]. [M1] Televideo 922 ASCII, Supplemental, Special Character Sets (4 pages) [22]. [N1] Sample screen from a data analyzer showing hex display [34]. (End)