CURRENT MEETING REPORT Minutes of the Inter-Domain Multicast Routing Working Group (idmr) Reported by Bill Fenner, Xerox PARC and Tony Ballardie, University College London First Session Deborah Estrin spoke first about interoperability mechanisms between PIM sparse mode and a multicast backbone, whether it be level two or current DVMRP. All groups use local RP's, and the border routers join towards the local RPs in order to inject traffic into the backbone. A single designated border router registers externally-sourced packets to the internal RP. Internal RPs advertise themselves on a special bootstrap group whose RP is elected via PIM Query messages. Open issues here include whether to explicitly notify the designated border router of group memberships inside the domain or to do "flood and prune," and how to use a domain of this type for transit traffic. Deborah then spoke about how to connect such domains with a PIM-SM backbone. The changes in this case are limited to the local RPs. When there is a PIM-SM backbone, there is a hierarchy of RPs; it is not yet clear if two levels of hierarchy are enough or if we need more. Join messages are modified to include a hierarchy level and crossing tree branches are handled by having the highest level branch "win." If a level one Join message meets a level 2 tree branch, it doesn't change the incoming interface. The local RPs join to the next level RPs. Candidate RP lists are distributed using the same bootstrap mechanism as in the local domain. Mark Handley then spoke about Hierarchical PIM, or HPIM. HPIM is similar to Deborah's scheme (and Deborah's scheme was derived from Mark's proposal), but Mark expects HPIM to have five to six levels of hierarchy. Each RP knows its own level and knows the candidate RP list for the next level up. The group address can be hashed into the list of RP's to determine the RP for that group. This gets rid of the requirement of storing RP/group mappings. The candidate-RP list is not meant to change often, but when it does, the old RP joins to the new RP to keep the tree intact and initiates transfers of its receivers to the new RP. HPIM also makes the following "gratuitous" changes to PIM, which are not directly related to the hierarchy of RPs. Join messages establish bi- directional forwarding state, not unidirectional, and they get ACK'd. RP failure is determined by timeouts and is handled by using an alternate hash function to determine a fallback RP. If there are local receivers to a domain, a sender in the same domain uses their (bi- directional) state; if not then it sends a "sender join" hop-by-hop towards the RP until it hits the tree; "sender join" state is unidirectional. The loop avoidance mechanism can result in unnecessary traffic flow but is relatively simple. Outstanding issues include that there is potentially too much configuration of the hierarchy and RPs; however, there may be a way to automatically configure such things. The hash function can lead to a sub-optimal top level RP, but since tree links are bidirectional you can potentially prune off the top-level RP. More thinking needs to be done on the topics of changing RP's dynamically based on traffic load, and on RP selection algorithms. The second half of the meeting was on providers' experiences with deploying PIM in their networks. First, Matt Crawford from Fermilab spoke about his experiences with PIM and High Energy Physics. HEP generally has large collaborations, and these collaborations meet several times per week using multicast conferencing tools. In addition, there are accelerator controllers which can multicast their readings, allowing multiple users to see results at once. Matt said that he liked PIM because all he had to do was turn it on in his routers and it "just worked," and because he owns many routers but few workstations. Steve Deering stood up and commented that Matt would probably have the same experience with any multicast routing protocol implemented in a router, and that the choice of PIM was simply predicated by the brand of router. Matt more or less agreed. Petri Helenius from Santa Monica Software spoke about MBONE deployment in Finland. PIM Dense Mode is deployed throughout Finland, and DVMRP pruning is implemented at DVMRP interoperability points, meaning that multicast should prune all the way back to Sweden. The emphasis was again that they had country- wide multicast reachability without requiring "kernel hackers" at each site. The issues that Petri brought up included that PIM-DM is wasteful on state on sparse-membership groups, NBMA issues have not yet been resolved, and the need for configuration guidelines for DVMRP interoperation and for suggested topologies. David Meyer from University of Oregon then spoke about having a multi-homed campus. He wanted to have ubiquitous multicast integrated with the existing infrastructure, as well as redundant external connectivity and integration, with Internet service providers. His network uses sparse mode PIM, but has the problem that his unicast topology is too rich, and PIM can't (yet?) RPF over a multipath link, so even if he has multiple T1's between two sites he can only use one. He glues sparse mode domains together with dense mode domains to work around policy and RP placement issues. He is providing multicast to 200 subnets, and it is nearly ubiquitous in his network. Second Session Day two began with a presentation on multicast traceroute, released recently as an IDMR Internet-Draft. All MBONE users were strongly encouraged to implement mtrace since it makes debugging problems so much easier. Currently, it is estimated 58% of the MBONE implements 'mtrace.' 'mtrace' has been designed to operate with the assumption that the underlying multicast routing protocol is based on RPF. Hence, a call was made to the designers of shared tree protocols (CBT, PIM-SM) for feedback on whether/how 'mtrace' can work with shared trees. A call was made requesting that IGMPv2 be moved forward to proposed standard. There were no objections. However, it is currently unclear whether this will actually move forward to proposed standard, or be published as an informational RFC. The Routing Area Director should decide on the best course of action, given that the group voiced no objections regarding its standardization. A CBT protocol update was given. The protocol has been considerably streamlined since the previous draft release (June 1995). A new draft indicating the very latest proposed changes should be submitted within the next week. At this stage, the CBT designers consider the protocol to remain stable (as far as the functional spec is concerned). However, it remains to be fully specified how CBT interoperates with DVMRP (and other protocols for that matter). The CBT designers are working on this and expect to announce an interoperability document shortly. The following is a summary of the CBT protocol updates: o Multi-access LAN designated router (DR) election has been re- invented. The CBT default DR is the same router as the IGMP querier. Hence, no protocol is required for CBT DR election. This assumes that any CBT-capable subnetwork has only the CBT multicast protocol running over it. If this assumption is not made, then the IGMP querier could be a multicast router of another scheme. For this scenario, interoperability between CBT and all other protocols needs to be defined. The working group is currently trying to establish a protocol-independent mechanism for interoperability to avoid each protocol having to define interoperability mechanisms between each other protocol. For the moment then, it is safe to assume that any one subnetwork is running only a single multicast routing protocol. o The core tree (the part of the CBT tree linking all cores together) is now built "on-demand." This requires that all group members and potential new members to agree on the identity of only the primary core router. The primary core, together with a list of alternate (secondary) cores is distributed throughout the network by some T.B.D. mechanism (e.g. hpim, advertisements, etc.). The on-demand core tree building works as follows: any secondary core that receives a join, first acks the join, then sends a join-request, code rejoin, to the primary core so the tree becomes fully connected. The primary core only ever listens for incoming joins; it never need join any entity itself. o Native mode. This assumes CBT routers operate in a CBT-only "cloud", i.e. multicast routers of other schemes are not active within the same cloud. This allows for much faster packet switching times, also helped by the fact that no RPF check is necessary. o Maintenance message "aggregates." Rather than have a CBT-ECHO-REQUEST/REPLY sent for each child/parent on a per group basis, the protocol now aggregates these messages so that only a single request/response pair is sent for any child-parent pair. This is especially attractive in those parts of the network where links are likely to be shared across groups. Considerable bandwidth savings are possible with this mechanism. o Rejoins and loop-detection. The dual mechanisms of rejoining and loop detection has been made simpler and more straightforward. The new scheme means that a rejoining node first receives an ack before it (rather than the node sending the ack) generates a rejoin-nactive (loop detection packet). This new technique avoids the router sending the ack performing any packet 'translation'. o Proxy-ack. Although discussed in the draft (spec) released immediately prior to IDMR Dallas, it was discussed in a meeting of the CBT designers immediately prior to the IDMR meetings, and decided that there was a case where proxy-acks should not be used. They are no longer present in the protocol. Overall, the CBT protocol has been streamlined considerably. There are now far fewer control messages, resulting in a further simplified protocol engine. An aggregated join mechanism is currently being worked through such that, subsequent to a router/link failure, groups with overlapping core(s) can send a single 'aggregated' join to re- establish connectivity, rather than have each group generate a join individually. One approach to how multi-protocol interoperability can be achieved at the L1/L2 hierarchical boundary was presented. The technique involves a L2 encapsulation, but does not require any exchange of routing information between L1 and L2. This contributes significantly to the simplicity of the approach. However, ideally, it is considered very desirable to have an L1/L2 interface that requires no encapsulation, and therefore the group is rethinking itŐs (previously announced) approach to hierarchical multicast. Nevertheless, the encapsulation approach will be released as an IDMR Internet-Draft. Finally, a pragmatic approach to bi-level multicast in the DIS environment was presented. This work has been conducted at BBN under ARPA's "Real-time Information Transfer and Networking" program in support of distributed simulation. DIS applications are not like teleconferencing applications; DIS assumes very large numbers of groups (10^4 or more), and requires very low join latency (< 0.5 secs or less). The bi-level environment consists of a "constructed multicast service (CMS)" built on top of an "underlying multicast service (UMS)." The CMS provides control traffic which is data to the UMS. Bi-level routers peer directly with each other. Bi-level routers (BLRs) use IGMP to determine CMS groups. This state is sent to all other BLRs so each BLR knows where group members are located. UMS groups are used to distribute data for CMS groups-BLRs join both UMS groups and CMS groups. The communication between BLRs about BLR group memberships needs to have a high degree of reliability. Joins/leaves use sequence numbering and timestamps. Each BLR sends MD5 hash messages to upstream neighbour summarizing its current state. Rather than "hard state" or "soft state", the state is said to be "firm"; it is like soft-state in that it is sent periodically; it is like hard state in that deltas are sent. However, unlike soft state, the information sent is a cryptographically strong checksum of the desired state, rather than the entire state. A bi-level multicast prototype has been implemented and tested. The UMS was provided by Bay Networks routers using DVMRP connected by ATM, linking six sites in the DC area. A simulation exercise was carried out using ca. 700 multicast groups. Some sites had 2000 join events over ca. 30 mins, which averages out at about one join per second. Official documentation from BBN (gdt@bbn.com) should be forthcoming shortly, describing the concepts and protocol of bi-level multicasting in detail. Once again, the DIS demonstrated the reality of the need for a next- generation multicast protocol that can better support its requirements.