====================================== iTXt proposal 19990110 Adam M. Costello ====================================== Based on comments by many subscribers of png-list. See below for rationale and a related proposed regarding registered keywords. The two proposals will be voted on as one package. If the proposal passes, the iTXt chunk will be added to the core PNG specification, in a new paragraph between the present iCCP and pHYs chunk specifications. The "rationale" section will be added to the Rationale appendix, replacing the current last paragraph under "text strings" of the core PNG specification. The second proposal will be added to the tEXt chunk specification, between the 2nd and 3rd paragraphs from the bottom. ============= iTXt Proposal: iTXt: International textual data Keyword: 1-79 bytes (character string) Null separator: 1 byte Compression flag: 1 byte Compression method: 1 byte Language tag: 0 or more bytes (character string) Null separator: 1 byte Translated keyword: 0 or more bytes Null separator: 1 byte Text: 0 or more bytes The keyword is case-sensitive and subject to the same restrictions as a tEXt keyword: it must contain only printable Latin-1 [ISO/IEC-8859-1] characters (33-126 and 161-255) and spaces (32), but no leading, trailing, or consecutive spaces. The compression flag is 0 for uncompressed text, 1 for compressed text. Only the text field may be compressed. The only value presently defined for the compression method byte is 0, meaning zlib datastream with deflate compression. For uncompressed text, encoders should set the compression method to 0 and decoders should ignore it. The language tag [RFC-1766] indicates the human language used by the translated keyword and the text. Unlike the keyword, the language tag is case-insensitive. It is a US-ASCII string consisting of hyphen-separated words of 1-8 letters each (for example: cn, en-uk, no-bok, x-klingon). If the first word is two letters long, it is an ISO language code [ISO-639]. If the language tag is empty, the language is unspecified. The translated keyword and text both use the UTF-8 encoding of the Unicode character set [ISO/IEC-10646-1], and neither may contain a zero byte (null character). The text, unlike the other strings, is not null-terminated; its length is implied by the chunk length. Line breaks should not appear in the translated keyword. In the text, a newline should be represented by a single line feed character (decimal 10). The remaining control characters (1-9, 11-31, 127-159) are discouraged in both the translated keyword and the text. Note that in UTF-8 there is a difference between the *characters* 128-159 (which are discouraged) and the *bytes* 128-159 (which are often necessary). The translated keyword, if not empty, should contain a translation of the keyword into the language indicated by the language tag, and applications displaying the keyword should display the translated keyword in addition. ========== Rationale: Keyword: Why not Unicode? Unicode is too fancy for the keyword, which is intended for both machine and human consumption. Even applications without Unicode support should at least be able to understand the keyword (to selectively delete chunks, for example). Keyword: Latin-1 vs. ASCII UTF-8 is used elsewhere in this chunk, and ASCII, unlike Latin-1, is compatible with UTF-8. There is a translated keyword, so restricting the keyword to ASCII would not be a hardship. So why use Latin-1? Because all other existing chunks containing keywords use Latin-1, so applications can reuse code they already contain. Compression flag and compression method: Why not combine them? We have deliberately avoided defining a null compression method in the past (for tXTt/zTXt), so that there would be no temptation to use it in IHDR. Language tag: It is not always clear how to render Unicode text unless it is known what language is represented by the text. Also, multiple iTXt chunks containing the same message in different languages could be present, and a decoder could automatically select the one most appropriate for its user. Translated keyword: Registered keywords, like "Description", are registered only once, in a single language (probably English), so that they can be recognized automatically. To be intelligible to speakers of another language, a translation must be provided. Text: Unicode vs. MIME charset name Including a MIME charset name would be more general, and allow the use of legacy character sets. But support for Unicode is growing, and allowing only Unicode is conceptually simpler and likely to eventually lead to greater interoperability. UTF-8 vs. UCS-2 vs. UCS-4 UCS-2 is short-sighted. Neither UCS-2 nor UCS-4 is compatible with ASCII. UTF-8 is both backward compatible with ASCII and forward compatible with UCS-4, and is generally the preferred encoding for interchange (as opposed to internal representation). =========================== registered keyword proposal: All *registered* textual keywords in tEXt and all other chunk types are limited to the US-ASCII characters A-Z, a-z, 0-9, space, and the following 20 symbols: !"%&'()*+,-./:;<=>?_ but not the remaining 12 symbols: #$@[\]^`{|}~ This restricted set is the ISO-646 "invariant" character set [ISO-646]. These characters have the same numeric codes in all ISO character sets, including all national variants of ASCII.