======================================
iTXt proposal 19990110
Adam M. Costello <amc@cs.berkeley.edu>
======================================

Based on comments by many subscribers of png-list.  See below for
rationale and a related proposed regarding registered keywords.
The two proposals will be voted on as one package.  If the proposal
passes, the iTXt chunk will be added to the core PNG specification,
in a new paragraph between the present iCCP and pHYs chunk specifications.
The "rationale" section will be added to the Rationale appendix,
replacing the current last paragraph under "text strings" of the core
PNG specification.  The second proposal will be added to the tEXt
chunk specification, between the 2nd and 3rd paragraphs from the
bottom.


=============
iTXt Proposal:

iTXt: International textual data

   Keyword:             1-79 bytes (character string)
   Null separator:      1 byte
   Compression flag:    1 byte
   Compression method:  1 byte
   Language tag:        0 or more bytes (character string)
   Null separator:      1 byte
   Translated keyword:  0 or more bytes
   Null separator:      1 byte
   Text:                0 or more bytes

The keyword is case-sensitive and subject to the same restrictions as a
tEXt keyword: it must contain only printable Latin-1 [ISO/IEC-8859-1]
characters (33-126 and 161-255) and spaces (32), but no leading,
trailing, or consecutive spaces.

The compression flag is 0 for uncompressed text, 1 for compressed text.
Only the text field may be compressed.  The only value presently defined
for the compression method byte is 0, meaning zlib datastream with
deflate compression.  For uncompressed text, encoders should set the
compression method to 0 and decoders should ignore it.

The language tag [RFC-1766] indicates the human language used by the
translated keyword and the text.  Unlike the keyword, the language
tag is case-insensitive.  It is a US-ASCII string consisting of
hyphen-separated words of 1-8 letters each (for example: cn, en-uk,
no-bok, x-klingon).  If the first word is two letters long, it is an ISO
language code [ISO-639].  If the language tag is empty, the language is
unspecified.

The translated keyword and text both use the UTF-8 encoding of the
Unicode character set [ISO/IEC-10646-1], and neither may contain a zero
byte (null character).  The text, unlike the other strings, is not
null-terminated; its length is implied by the chunk length.

Line breaks should not appear in the translated keyword.  In the text, a
newline should be represented by a single line feed character (decimal
10).  The remaining control characters (1-9, 11-31, 127-159) are
discouraged in both the translated keyword and the text.  Note that in
UTF-8 there is a difference between the *characters* 128-159 (which are
discouraged) and the *bytes* 128-159 (which are often necessary).

The translated keyword, if not empty, should contain a translation
of the keyword into the language indicated by the language tag, and
applications displaying the keyword should display the translated
keyword in addition.


==========
Rationale:

Keyword: Why not Unicode?

    Unicode is too fancy for the keyword, which is intended for both machine
    and human consumption.  Even applications without Unicode support
    should at least be able to understand the keyword (to selectively delete
    chunks, for example).

Keyword: Latin-1 vs. ASCII

    UTF-8 is used elsewhere in this chunk, and ASCII, unlike Latin-1,
    is compatible with UTF-8.  There is a translated keyword, so
    restricting the keyword to ASCII would not be a hardship.  So why
    use Latin-1?  Because all other existing chunks containing keywords
    use Latin-1, so applications can reuse code they already contain.

Compression flag and compression method: Why not combine them?

    We have deliberately avoided defining a null compression method in
    the past (for tXTt/zTXt), so that there would be no temptation to
    use it in IHDR.

Language tag:

    It is not always clear how to render Unicode text unless it is known
    what language is represented by the text.  Also, multiple iTXt
    chunks containing the same message in different languages could
    be present, and a decoder could automatically select the one most
    appropriate for its user.

Translated keyword:

    Registered keywords, like "Description", are registered only once,
    in a single language (probably English), so that they can be
    recognized automatically.  To be intelligible to speakers of another
    language, a translation must be provided.

Text: Unicode vs. MIME charset name

    Including a MIME charset name would be more general, and allow the
    use of legacy character sets.  But support for Unicode is growing,
    and allowing only Unicode is conceptually simpler and likely to
    eventually lead to greater interoperability.

UTF-8 vs. UCS-2 vs. UCS-4

    UCS-2 is short-sighted.  Neither UCS-2 nor UCS-4 is compatible with
    ASCII.  UTF-8 is both backward compatible with ASCII and forward
    compatible with UCS-4, and is generally the preferred encoding for
    interchange (as opposed to internal representation).


===========================
registered keyword proposal:

All *registered* textual keywords in tEXt and all other chunk types are
limited to the US-ASCII characters A-Z, a-z, 0-9, space, and the following
20 symbols:

   !"%&'()*+,-./:;<=>?_

but not the remaining 12 symbols:

   #$@[\]^`{|}~

This restricted set is the ISO-646 "invariant" character set [ISO-646].
These characters have the same numeric codes in all ISO character sets,
including all national variants of ASCII.