Internationalized domain names (idn) wg meeting notes IETF Pittsburg, Aug 2000 Notes done by David Conrad, Thanks, David. Agenda bashing -- no changes. 1. Marc Blanchet: WG update 1.1 new rev of charter since last meeting. major changes: - specifiying standards track protocol based on requirements - fundamental requirement to not disturb existing dns - WG must identify consequences of resulting protocol - WG needs to insure good communication with interested groups - goals and milestones have been modified - finishing early would be nice 1.2 New working group web site: http://www.i-d-n.net - complement of main IETF web site - official IDN-WG site - managed by Marc Blanchet 1.3 RFC 2026 reiteration - per POISED, contribution means: presentation, email, internet-draft, comment, etc. 2. Requirements Draft (James Seng for Zita Wenzel) James, as WG co-chair is no longer editing requirements draft. No presentation, will go through the ID and highlight the important bits Version 3 removes many of the requirements of version 2 which was felt to have too many (35). Likely no proposal could meet all requirements in v2. We spent 3 months going through the requirements to see what could be removed, what would be nice, etc. We have come to a consensus that we should use Unicode as the base character set. Any proposal which uses localized encoding will not meet IDN requirements. New section to clarify difference between hostnames and domain names. Graphic representation of DNS architecture/infrastructure from Harald included. Focus our energy on the big box in the diagram (forwarding, caching, parent-zone, and root server). Will consider the other boxes, but not the focus. KM: most important parts aren't in the picture. If you concentrate on wire protocol and don't consider users then the effort will fail. Must consider wider picture. Thorny issues lie in non-protocol interactions JK: Computers don't care. WG is important due to the interaction of people with computers. JS: We won't ignore the other aspects, but must remain focused on what must be done, not on what is outside of WG scope. IAB has an RFC on internationalization that addresses things the WG should consider. If we can't solve the basics, then we can't go on to the next steps. DC: There is a standard that we make that isn't over the wire. In constrained circumstances -- business card model -- we must deal with non-protocol stuff. What can go on business cards will affect what we're doing. Requirements: * IDN must not break existing DNS. Minimize changes. * must preserve basic concepts and facilities, must maintain single, global, universal, and consistent hierarchical namespace. * new addition: no restriction on Unicode codepoint in wire format, but restrictions can be imposed elsewhere (registration, etc.). * domain names must resolve correctly. * document recommends Unicode only. If multiple character sets are allowed, each charset must be tagged and conform to rfc2227. We don't want to try and invent new unicode system. * canonicalization must be done for internationalized domain names. What normalization form should be adopted? (C, D, KC, KD, new form KR?). Where should normalization should be done (server or client)? Canonicalization/normalization rules should be locale independent. * Zone files should remain easily editable. * Protocol must work with DNSSEC. * Protocol must work with v4 and v6 and all features of the DNS. AB(?): Which Unicode version? (3.0), bidirectionality? (yes) KM: Fundamental assumption appears to change the DNS -- I don't see that as appropriate for a requirements document. The interactions you care about are app to app, app to user, and user to app -- none affect the DNS. Stuff that happens at higher layers is much more important that what happens on the wire. JS: Does the reqs doc give the impression that the DNS is to be changed? KM: There are implications, yes. The focus is on the DNS protocol but the problem is higher up. MB: Will your concerns be sent to the mailing list? KM: Yes JS: Didn't mean to give the impression that the DNS was going to change. HA: when thinking of the DNS as a set of services, if we are to keep sane, then we should think of interationalized equivalents as new services that are to be made available, not as changes to existing services. Mapping of name to address should have to services -- map as we know it and map as the future may require. We shouldn't expect to convert applications by switching lower layers. The new services might not work exactly the same way the existing services. MB: will you write an draft about this idea JK: you can assume a draft will appear KM: I agree with Harald. I believe there is a whole set of missing requirements for incremental deployment. You have to have the least possible disruption. Changes must be independent of each other. JS: see requirement 10. JI: this is a problem we should not be solving. What problem are we trying to solve? JS: this should go to the mailing list. 3. RACE (Paul Hoffman) draft-ietf-idn-race how to do an ascii compatible representation of internationalized characters. this proposal does not specify how it is to be used. fully compatible with today's DNS. 3 step process: - compress input text - convert compressed string to base32 - mark with a prefix (currently 'ra--') prefix will change. each name part must be 63 octets to conform with the existing DNS. race favors names that are all in one row. can get up to 35 characters if single row. can get up to 17 characters if two or more rows and one of the rows is non-zero. can get 17 to 33 characters if usign two or more but also using row 0. RACE is an ace format in ace-1 in the comparison document. Includes an identifying mechanism for ace-2 namely ace-2.1.1 HA: have you considered using UTR-6? Yes, you don't get alot of advantage and UTR-6 does a lot of bit shifting which will be hard to implement. KM: do you define a canonicalization form? No. KM: are their multiple outputs? No. Another reason not to use UTR-6. AG: strange to propose ways of compressing into 63 ascii since the wire format doesn't care -- the 63 limitation is at the resolver. Yes. AG: restrictions are likely not per label. JK: applications are likely to make bogus assumptions. BS: not using ACE on the wire? Yes. BS: what is ace expecting to receive? it is expecting unicode code points. input to the compression is utf-16. 4. UDNS (Paul Hoffman for Dan Oscarsson) draft-idn-udns-00.txt attempts to be a full protocol specification. how do you flag idn awareness in dns queries so idns can be handed back. if not flagged, you must not give back internationalized names. how to flag: use the IN bit in the DNS query. last unused bit in the second word. arguably safe. proposes UCS normalization form C encoded in utf-8 with an ACE for backwards compatibility. DC: how does the length limit issue affect idn? PH: UTF-8 restricts length of non-English idn's. OG: deployment problems due to forwarding or recursive servers -- some servers blindly copy those bits. PH: Right. MA: Broken servers are broken servers. Don't try to work with them. JS: On length issues, Thai names are very very long. DC: some length limit is a fact of life. PH: yup. 5. ICU (Hyewon Shin) draft-ietf-idn-icu uses IN bit to identify queries use UTF-8 as wire format case folding/canonicalization before transmission IN bit indicates wehether the query is from IDNS resolver/server or not and reduce overhead of canonicalizatin unicode as CCS utf-8 as CES all domain names queries should be encoded into Unicode before being used in resolvers. resolvers convert the queries into UTF-8 case folding in locale independent before transmission indicated by IN bit valid query formats are indentified with the IN bit JS: change the title of your internet draft -- calling it the architecture of internationalized domain names is misleading. PH: you talk about case folding, but you don't talk about canonicalization of the more complex stuff (a+umlaut vs. a-umlaut). Will non-canonicalized names get passed to the resolver? canonicalization is not addressed. it would be done at the same place as case folding. BS: the resolver does the UTF-8 encoding -- what is the application sending to the resolver? we assumed unicode. DC: proposing the creation of a parallel DNS service? yes. DC: do you discuss interworking with existing DNS? not yet. 6. Microsoft's approach (Stuart Kwan) draft-skwan-utf8-dns-04.txt Microsoft had a requirement to move people off WINS. WINS allowed the use of Unicode names. KM: who imposed the requirement SK: WINS didn't scale. In Win2K client can initiate query with unicode name, resolver converts names to UTF-8 (does not downcase). On the server side, database load downcases. On query, downcase and do a byte-for-byte comparison. Very few changes since -00 draft. Win2K implemetnation hasn't changed. Biggest flaw: there is no normalization. Not sure what is the best. Would like to be published as informational. Michael Patton: make editorial changes to update about what've you've learned. SK: there is a big emphasis to only use these names when absolutely necessary. But we'll update the draft as requested. PH: What is WSALookupServiceBegin/Next but that doesn't exist in the draft. SK: Application gives us unicode and we turn it to utf-8. PH: so this sends utf-8 over the net. SK: yes. this tends to be self-correcting. PH: needs to be discussed in the document. SK: OK. BS: any experience with existing applications. SK: userbase is too large to poll, but nobody has complained. SK: Microsoft will implement the idn standard when ready. 7. IDNE (Marc Blanchet) Until a month ago, no proposal using EDNS. Rationale: - use 8 bit - use only one character set and encoding - transformation on client side - versioning control to adapt new chars, languages, etc. - use standard dns extension mechanism Description - chars in labels are UTF8 - strings in labels are pre-processed by nameprep - idn labels use ENDS extension strings in labels are pre-processed using edns, - elt 0b000010 is used - size of idn label - idn label encoded in utf-8 - idne labels can be mixed with std13 - regular compression scheme is supported current maximums: - label = 63 - dn = 255 idne maximum are: - label = 255 - dn = 1023 rationale: - utf8 encodes 1 i18n char up to 4 octets so multiply by 4 - idne udp packet size must support 1220 octets equiv to ipv6 minimal - MTU - sender must announce via OPT - idn protocol version in opt pseudo-rr rdata field with option code - this doc with nameprep defines v1 of idn - permits idn revisions idn api - getnodeipbyname and getnodeipbyaddr specified in RFC 2671 - idn flag to be added - no more return codes seems to be needed transition and deployment - idne depends on edns - need for an ace for short term? - depends on speed of edns deployment - v6 and dnssec require edns - 2 protocols make things more complex. one can be chosen forever - names defined in the ACE must be represented in IDNE Enhancements? Yergeau proposed major and minor revision numbers - minor for incremental table changes that do not require new algorithms so no code change, just load a new table. - major for major revisions that need code change Language tagging? Compression needed? MA: extending total overall length of a name is problematic. MB: yes but application must be IDN aware. JS: language tagging using plane 14? MB: yes, since using edns give more space. Are you using stateful encoding? MB: No. OG: Very good first start. Since you use EDNS, you only use modern servers and you can determine if downstream servers can work with EDNS. DC: Your statement that near term may replace the long term is very insightful. A lot of pressure now to deploy. MB: Yes. 8. Name Preparation (Paul Hoffman) draft-ietf-idn-nameprep Requirements: - output of a single unambiguous string given an input - lets user to enter anything that might look right to them - typical user should be able to follow logic of preparation current order: - check for prohibited input (many) - fold case - canonicalize with normalization form KC Possible altenative - check for prohibited input (a few, just for case) - fold case - canonicalize with normalize form KC - check for prohibited output open issues - prohibit on input or output? - allow characters that would be ignored, e.g. hebrew vowels are optional? - include folding that is specific to the language of the name (not just script used)? How would language information be known? KM: locale specific feedback mechanism may imply the DNS is simply unsuitable to do internationalization. PH: Yes. Currently no documents xxx Where do we do name preparation? 3 places possible: - application - resolver - dns service there are reasons to do it in each and really good reasons to not do it in each. document is neutral. TH: seems hard to get the error conditions out of the 4 step model. Is there any other group else who can solve this since we don't have the expertise to do this? PH: No one has stood up to this task. JK: There are a few organizations who have looked at this and run away. PH: individuals at those organizations have indicated they'd help DC: dns service will guarantee failure -- there will be enough infrastructure change that it'll take years to deploy. what is the goal? DNS has no semantics on the strings. Probably in the resolver. JS: reverse logic -- forbid characters by default, permit specific characters. PH: they are equivalent. JS: easier to check what you want than what you don't want. ===================================================================== 1. Using DNS for Canonicalization data - IDN working group has no consensus on how to apply local canonicalization rules. - Unrealistic for all systems to have all local rules - Dynamically learning rules is desired. Items that should be defined as local canonicalization rules: - list of characters can be used as internationalized domain name lables case folding or mappings - common normalization/canonicalization rules to be adopted - order of nomralization/canoncialization rules - How to get this information: use the DNS - Provide the mappings via the DNS Advantages: - Rules can be administered by domain authoritites - version of rules can be controlled via serial number - caching effect works Disadvantages: - increases dns queries - hard to adopt to intermittantly connected sites - CJK has lots of data - simple rules only How to provide: define usable characters as txt rrs define meta information as txt rr - normalization rules - version of normalization rules use idn.arpa domain for table defintion norm-form - early normalization method name norm-form-version norm-form-url "." - a character the same as the label ".^" - a character the same as label but not allowed as the first (e.g., '-') ".$" - same as above but can't be used as the last "a" is the character itself how it works: - query for tld.idn.arpa -- if fails, no rule adopted - look for common normalization method and its version - look up each character Issues: - what are internationalized TLDs -- need special methods to canonicalize TLDs themselves if they were not ASCII-only - reducing the number of queires -- each character generates a query - examination of meta labels - escape syntax to use symbols in ASCII as labels Using DNAME to reduce queries - folding charactges into sequence, queries can be reduced add DNAME and CNAME for each canonicalized character use DNAME instea of TXT Issues for this method: - both servers and clients must be able to use DNAME - servers for IDN.ARPA must be recursive -- do no resolve overhead -- restriction on the number of aliases - easily exceeds packet size limits -- need EDNS0 - clients must analyze response -- move overehead OG: both DNAME and CNAME are terminal nodes. Also CNAME can't be used with anything else but security records. You assume servers don't do case folding but they do. DNAME can't point up in the hierarchy. LJL: suppose I want to register a name in .JP. Rules are connected to the TLD and not to the language -- the rules should be connected to the language not the nation. Rules are fixed per name. Using TXT: won't the resolver get confused? IDN.ARPA being recursive won't work. How does this work with DNSSEC? YY: TLD defines the rules for registration. JS: can we move it to the mailing list? The TLD defines what characters are valid. HA: what is the advantage of this per-character approach? Why not put posix local def into a domain? YY: DNS is not the best method of doing this, but this is used for the DNS, so only the DNS is being used. HA: revise the proposal without storing character data in the DNS. Also think about how this works with clients that do not have the code and what the size of the client code will be and how often you expect the client to upgrade. I think this approach has some good things, needs more discussion. 2. Han ideograph for IDN (James Seng) why I did this draft: - lack of information on han ideographs (HI) in IETF - HI is very complex -- over 103,000 characters, each having their own pronunciation, etc. - draft also talks about CJK - encourage discussion and encourage others to write drafts on their scripts HI are CJK composed of radicals which are made of simple strokes HI originated from China HI commonly used in China Japan Korea Taiwan Hong Kong Singapore Malaysia Case folding: - conversion between ideographs can be done in various ways. -- character based == word to word -- lexicon and context based == translation related issues - Unicode does CJK unificatin - 27,786 HI in Unicode - Unicode normalization & canonicalization defined in UTR15 and 21, but handling is limited zvariants are HI which share the same etymology but the glyph varies in some minor way -- should be considered equivalent Chinese: - Originated from pictographs, but not all ideographs are pictographs - because of origin, each HI has a meaning - Chinese was simplified in 1950 (simple), original known as traditional there are 2244 SC in last official count and Unicode has 2145 there are multipe TC for one SC. SC-to-TC is almost impossible (need context information) TC-to-SC may be workable -- may not be perfect, but it can work. TH: we should do code point to code point because mapping will make thing far too hard due to the need for contextual information. JS: please read the draft -- just discussing the issues. TC and SC aren't usually mixed. TH: not true. SC and TC are seldom used in the same phrase. You can solve mapping using CNAME and DNAME. Korean: Hangul is more commonly used now instead of Chinese derived characters. Hangul doesn't have meaning like Chinese ideographs. Have their own ideographs with simplified forms. Japanese: Kanji, Hiragana and Katakana. Kanji is based on Chinese. Hiragana is a sylabary Japanese in written form is a vocal script which maps how it is pronounced fairly accurately. Most verbs and nouns are written in Kanji. Depending on context, pronunciation may be different. Conversion between hiragana and kanji is not practical. Has their own ideographs (kokuji) with simplified forms. Ideograph Description: The same characters can be constructed in multiple ways. Mechanism: HI may or may not be folded for the comparison of domain names. Folding may occur at - DNS clients or by user agents - DNS servers - registration time In particular, folding during registration time is critical for operational reasons even if we do not adopt any Han folding. HA: to summarize, one of the real problems is that when someone presents a domain name, they feel they have the right to all the variants, e.g., a chinese name can be represented with different characters in Korean chinese chars and Japanese chinese chars. PH: saying people will expect certain things. We shouldn't listen to what will be legal or not -- just focus on languages. 3. DNSII-MDNP (Edmon Chung, David Leung) written an internet-draft, but missed deadline. been working on this for more than a year goal: put all the characters into the internet - must pass the business card problem - must not break anything - must be flexible - less impact on the client -- pain should be in the resolver dnsii protocol has two parts - identifier -- inserted before a label -- first two bits, use of the bit sequence "10" prefer 10 over edns (01) since the 3rd and 4th bits for future expansion. EDNS reduced possibilites for future expansion - packet label -- a 12 bit number used to determine the encoding scheme - valid encodings specified in a list. there should be no ambiguity from RFC 2277: all protocols must identify for all character data which charset is in use. compression and edns will still work as expected should not require any adjustment to dnssec or ipv6 charset encoding: - uses iso10646 - flexible to encompass other encoding schemes - all legal symbols canonicalization: - applications should - servers must - use form C recommended Han folding is similar to treating color and colour identically. we have working code. this approach is patented HA: nameservers everywhere must be able to convert between all 400 characters sets, right? An implementation decision. HA: What do you do when it encounters a character set it doesn't understand? The fall back should be back into UTF-7, if still can't be found, return an error. HA: So the client must know how to convert? No. Take it to the mailing list. OG: two observations, I strongly encourage everyone using EDNS label types allows clients to discover capabilities on the server. Don't worry about saving a few bits. Think more about how ideas should be expressed. DC: One of the considerations of EDNS we worry that the second byte is a second count. OG: send a note to namedroppers on protocol issues. PF: two questions: re-emphasize what Harald says. From the email world, best to use as limited characters. Do client side conversion into simple character sets. second: what about future character sets -- you will have to fallback all the time. DC: We encourage the use of Unicode. PF: For this to not become a local solution it should be done as close as possible to the client. 4. Evaluation of proposed Encodings for IDN (Yashuhiro Morishita) mDNKit -- multilingual domain name evaluation kit objectives: - evaluation of technology - promotion of standardizatino - technical contribution Developed by JPNIC, released Jul 13 components: - dnsproxy server - codeset coverter - commmon library for handling multilingual domain names - patches for 8-bit clean bind and tools to override unix shared libraries evaluated drafts: - race - skwan-utf8 - jseng-utf5 evaluation points limitations of usage - name length - interoperability with dns - ease of operation - usage of multilingual domain names race and utf5 are ascii compatible encoding - available hostname length shorter than utf-8 - compression but needs all characters in a single row utf8 has incompatibilities with present DNS - hostname not 8bit clean - applications use 'checknames' ace strings need identifier to distinguish from normal ascii strings. -- uses ra-- utf-5 uses zld charset encoding/conversion tools are essential race currently best method suitiable for transition utf8 incompatibile with current dns todo: - interop testing - evaluating more drafts LM: if you evaluate cut and paste, most of these systems cut and paste doesn't work very well. any of these systems had usable cut and paste? YM: race is best current mehtod for cut and paste since it uses only ascii. LM: if you have a display method that show race in ASCII and you cut/paste it how do you have interoperability between application and display? JS: properly implemented system will use MIME on the paste. 5. NuBIND Implementation (Bill Semich) original goal was to internationalize BIND maximum support for internet standards rfc2277, 2279, iso-10646, Unicode UTR15 and 21 3 components: - rdns - lute (transitional) - uvce JK: intellectual property? BS: not submitting this as a IETF submission. "nubind" name is trademarked. Current implementation status: - nubind operational for 8 months - 5 in second-level domain 'eu.nu'. - 3 in a slave server for .NU - many current and potential external implementation problems that will need to be dealt with mail servers, others unexpectedly fail legacy dns servers security considerations - ssl and x.509 certs must be modified to work with 8 bit dn - inverse lookups application problems - browsers no standard support client environment problems dhcp server configuration host lcient with an idn resolver setting all current unix resovlers support ascii only http server problems postpone implementation of idn in the DNS until minmal impact standards or alternatives are accepted minimize impact on DNS Why use the internet infrastructure to achieve application goals. JS: will you submit an ID? BS: Can submit TN: need to remove copyright notice in presentation. BS: take it offline. 6. Comparison of IDN proposals (Paul Hoffman) draft-ietf-idn-compare wrapup of technical presentations. talking about comparison doc. Basic idea of doc is to describe significant features that we need to think about. Includes pros and cons and what features are really needed. sections of docs - architecture - names in binary - names in an ACE - prohibited characters - canonicalization - transitions - root server considerations - security considerations arch: - just send binary - send binary or ace - just send ace will be updated to add details about what is sent between app<->resolver, resolver<->server, server<->server names in binary - utf-8 or labeled charsets - distinguished binary from current format will be updated to add where the different markings will be used names in ascii - format - how to distinguish ace prohibited characters - identical or near-identical characters - separators - non-displaying and non-spacing characters - private use characters - punctuation - symbols BM: had a machine called . There is a distinction that needs to be made between hostname and domain names. PH: Yes. MA: reference RFC is 952, not 1035 JS: lot of confusion on this issue. HA: this is in the requirements doc canonicalization - type of canonicalization (normalization form C or Form KC) - other canonicalization (case folding (ASCII/non-ASCII), han folding) if you want a good description, see the current Unicode standard. - where is canonicalization done - location of canonicalization may determine how quickly idn can be deployed. transitions - always do current plus new architecture - transition period draft will be updated to add specific details for transition and what needs to be transitioned? EN: are there drafts that talk about transitions? PH: No. Drafts on transitions do not need to be associated with a proposal. PF: need clear distinction between Unicode consortium work and this WG. PH: part of transition will include how to get groups outside the us to transition with us. ISO has groups which determine code points in the repertoire. Unicode consortium does not add code points to ISO standards, ISO does. As such, we don't need to liaison with ISO -- just need to be aware of what ISO does in this space. Root server considerations - don't want RSes to blow up - how quickly can we have real IDNs in the TLDs JS: RS ops worried about operational implications, e.g., how the RS op will verify data is correct. Security considerations: - don't want to reduce general security of the DNS - biggest issues include IDN names in digital certs and name spoofing Expected changes: - add ideas from new drafts - add comments about the drafts from the list - give more detail on canonicalization - additional details about effects on apps, resolvers, and servers please specify categories - will help readers which parts of IDN the draft covers please talk about this on the mailing list draft will be updated within a few weeks BS: might be important to list patent and IP issues. PH: maybe. will defer to the AD. EN: we already have a process to do this. PH: might be worthwhile to list IPR IETF has been notified of. MB: I can put it on the website EN: use a generic notice, not a listing of IPR 7. Working Group Next Steps (Marc Blanchet) Requirements doc - contentions issues? incomplete? - ready for WG last call? - RFC informational? Need minor revisions. Pretty ready to move to last call. EN: there are some items that need to be resolved. HA: need to look at Keith's comments. The chairs/author should declare that the comments on the draft should be 'identify problem, old text, new text'. No comments will be accepted that aren't in this form. Have a hard deadline (2 weeks). MB: document editors agree? JS: yes. MB: OK. Comparison document - keep it going by enhancing it - RFC informational? JS: should consider transition period. HA: could discuss the transition properties without a proposal, but mechanisms will depend on proposals. MB: wg agreement to keep it going. 3 types of solutions - do not change the DNS, application layer solution - do change the DNS - directory based solution (with or without changes to the DNS) Want one protocol at the end. Convergence process: - have all current authors work together on a converged solution? - use comparison document as the seed doc for discussion EN: list 3 solutions, but no proposals to converge. should focus on the dns proposal OG: good way to procede. might be better to 'cherry pick' from all the proposals. maybe use authors as the design team. JK (pretending to be Zita): she wants to reiterate taking discussions to the list. Also agrees with Harald's proposal. HA: in the solution space, trying to converge on the best dns based solution. if we can't come up with a solution that meets the requirements or can't be deployed, then we look at other approach. BS: a long term solution may be more appropriate to look at than short term. JS: other solutions not using the DNS may exist, e.g. directory based solution. Everyone agree to the process? Need to set up the design team -- please send mail to MB. AB: Aaron Brunner AG: Andreas Gustafsson BM: Bill Manning BS: Bill Semich DC: Dave Crocker EN: Erik Nordmark HA: Harald Alvestrand KM: Keith Moore JK: John Klensin JI: John Ioannidis JS: James Seng LJL: Lars-Johan Liman LM: Larry Masinter MA: Mark Andrews MB: Marc Blanchet OG: Olafur Gudmundsson PF: Patrick Falstrom PH: Paul Hoffman TH: Ted Hardie TN: Thomas Narten