CURRENT_MEETING_REPORT_ Reported by Steve Casner/Precept Software Minutes of the Audio/Video Transport Working Group (AVT) Thanks to Joerg Ott and Carsten Bormann for their notes on the discussion which served as input for these minutes. Overview A meeting of the AVT Working Group was a late addition to the schedule at IETF in Stockholm to allow a face-to-face discussion following the recent e-mail exchanges about coordination with ITU-T Study Group 15 on the use of RTP. The second half of the second MMUSIC session was used for this purpose. In addition to this primary topic, a few other questions, listed below, were discussed. Coordination With ITU Joerg Ott gave a brief summary of the discussion at the SG-15 meeting held in Stockholm in May to discuss the H.323 and H.22Z recommendations for interworking LAN-based audio/video terminals with H.320 ISDN terminals. The scenario involves a gateway to provide address and protocol translations at several levels, with audio/video data transfer and multiplexing being only one. At that meeting several viewpoints were expressed with regard to RTP, ranging from defining a new protocol (H.22Z) that was only ``inspired'' by RTP, to using RTP as-is and defining a new setup protocol to go with it. At that time, SG-15 decided not to use RTP because of several problems they perceived. However, at a subsequent meeting organized by Rich Baker at PictureTel and in e-mail discussion, this decision was reversed. It appears that the current position is to pursue use of RTP and the RTP A/V Profile as defined unless it turns out that this scheme will not work. There remains many questions about how connection setup will be done, but the specific problems regarding the use of RTP seem readily answerable: o RTP's presentation timestamp is not sufficient; a transport timestamp should be available for QoS measurement. The RTP timestamp was intended to be useful for QoS measurement (via the jitter field in the RTCP reception report). We believe it will work; if it does not work for ITU purposes it will not work for ours either. The mechanism needs to be demonstrated in practice during the Proposed Standard stage. Further details on use of the jitter measure with video formats are given in the next section. o H.323 needs to work over protocols other than IP (e.g., IPX). This is not a problem. RTP has no specific dependencies on IP; it requires only framing and multiplexing of RTP/RTCP from the layers below. o Provision of lip-sync if audio and video streams do not originate from the same source. RTCP includes timestamps that allow playing in synchrony any sources that can reference a common clock. It is suggested that absolute (wall clock) time be used as that reference when possible, and that the Network Time Protocol may be used to provide synchronization of the system clock to absolute time. If some system has no notion of absolute time, it can use elapsed time instead if all the sources to be synchronized can count the same elapsed time. If no reference clock is available, it seems unlikely that any alternative transport protocol could provide synchronization either. o Lack of stability in the RTP, profile and payload data format documents which are only in draft form. While there have been a number of changes during the time RTP has been designed, these documents have now reached a stable state. The main RTP specification and RTP profile have been submitted for IESG Last Call already, and the H.261 payload format specification will be submitted for Last Call immediately after IETF. These should be published as Proposed Standard RFCs by September. o Distinguishing multiple streams from the same source. Each ``RTP session'' is intended to carry only one medium. Multiple media should not be multiplexed in one RTP session based on the payload format code. Multiple streams from the same source may be sent in separate RTP sessions (destination transport addresses), in which case the SSRC may or may not be the same for each session (it is not required because the linkage is provided through the RTP CNAME). It is also possible for one host to send multiple SSRCs in one RTP session, for example to transmit video from two different cameras. o RTCP is insufficient for H.323 call setup. True. The RTP specification says that the use of additional control protocols may be required. o Lack of ITU control over payload format codes in the RTP Audio/Video Profile. The current plan is to proceed with the RTP profile as specified, which includes the additional code points that were requested for ITU-T standard encodings. There should be no problem adding new ITU-T standard encodings in the future since we will also want to use them. Interoperability will be maximized if this profile is found to be sufficient for H.323 purposes as well, but if not, another profile could be defined to provide a payload format code space dedicated to the ITU. It seems most important to get the RTP specifications published first to establish them as a stable base. During the Proposed Standard stage of the IETF standardization process, if the current specifications are found to be inadequate either for general use on the Internet as planned or more specifically for the interoperation planned by H.323, then those changes may be introduced before going to the Draft Standard stage. However, it is not expected that any substantial changes will be required. Jitter Measurements For Video Formats It is valid to ask about measurements on video formats where the same timestamp is used for all packets in a frame. In some sense, it is the network that imposes the variation in delay implied when transmission of the video packets is spread over the frame interval rather than occurring all at once, so it is reasonable to include it in the jitter calculation. On the other hand, it is expected that the jitter measure will be primarily used to compare the behavior observed by different receivers. The jitter measure can also be calculated by the sender for the traffic as transmitted and then compared to the jitter reported by a receiver. This allows cancelling out the jitter introduced by using the same timestamp for all packets of a video frame. If the first packet of a frame were marked in some payload format-independent way, then it would be possible to calculate the jitter using only those packets, which are sent with minimum delay after the frame is sampled. However, since the packets of a frame may represent a burst, later packets in the frame may experience more delay, so measuring only the first might not be accurate. For the MPEG video format, the transmission order of I, P and B frames is not the same as the presentation order. This introduces significant additional noise into the jitter calculation. It is possible to correct for this by observing the I, P and B bits in the MPEG header at the receiver and adjusting the timestamps accordingly before doing the jitter calculation. Don Hoffman at Sun reports that they have prototyped this scheme. This works fine for receivers, but a profile- and payload format-independent monitor would not have this information. Other RTP Questions In addition to the ITU coordination questions, there were a few questions brought up recently on the working group mailing list that were discussed in this meeting. o The latest RTP profile draft specifies a 90000 Hz clock for the RTP timestamp in all video payload formats to replace the 65536 Hz clock rate used previously. This matches the choice made by the designers of the MPEG encoding to be a multiple of all of the video frame rates in common use, and is the choice recently made by the authors and implementors of the RTP payload format for H.261 video. It is requested that the authors of the other video payload format specifications update those specifications to reflect the new clock rate unless there is some reason that the old clock rate must be used. No objections were voiced and no other comments on the RTP profile were offered. o The RTP payload format for MPEG video specifies two formats, one of which encapsulates MPEG Transport Systems format. In that format, the position of video frame boundaries is not known to the process doing the RTP encapsulation. Instead, the RTP marker bit is used to indicate the start of a ``payload unit''. Note that choosing the start rather than end is at odds with the convention for other video formats, but is more convenient. There is no advantage to marking the end as there is with the other video formats. No objections were raised. o Vineet Kumar asked how multiple audio streams fed through a mixer could be synchronized with a video stream that was not sent through the mixer. The answer is that the audio streams can all be synchronized and the mixed output emitted with the same timing if the sources all have synchronized clocks. If not, then RTP does not solve this problem. o Feedback was invited on points where the RTP specification may not be as clear or explicit as is needed. These should be sent to the authors or to the working group mailing list (rem-conf@es.net). Future Activities As mentioned above, the main RTP specification and RTP profile should be published as Proposed Standard RFCs in September. All video payload formats should be posted for Last Call as soon as possible and then published as RFCs as well. This will complete the working group's charter.