Internet Engineering Task Force Kretschmer-AT&T/Basso-AT&T INTERNET DRAFT Civanlar-AT&T/Quackenbush-AT&T File:draft-ietf-avt-rtp-mpeg2aac-01.txt Snyder-AT&T October 22, 1999 Expires: April 22, 2000 RTP Payload Format for MPEG-2 AAC Streams STATUS OF THIS MEMO This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document describes a payload format for transporting MPEG-2 AAC encoded data using RTP. MPEG-2 AAC is a recent standard from ISO/IEC for coding multi-channel audio data. This payload format increases the packet loss resilience of AAC coded audio transport above that of 'RTP Payload Format for MPEG1/MPEG2 Video (RFC 2250)' [5] by incorporating AAC properties into the payload format. Also, the MPEG-2 AAC bitstream format is not backwards compatible with other MPEG-2 audio formats. Several services provided by RTP are beneficial for MPEG-2 AAC encoded data transport over the Internet. Additionally, the use of RTP allows for the synchronization of MPEG-2 AAC with other real-time streams. Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 1] INTERNET-DRAFT RTP Payload Format for MPEG-2 AAC Streams October 1999 1. Introduction The ISO/IEC MPEG-2 Advanced Audio Coding (AAC) [1] technology delivers CD-like or better multichannel audio quality at rates around 64 kbps/channel. It has a flexible bitstream syntax that supports from 1 to 48 audio channels, up to 16 subwoofer channels and up to 16 embedded data channels. AAC supports a wide range of sampling frequencies (from 16 kHz to 96 kHz) and an extremely wide range of bitrates. AAC can support applications ranging from professional or home theater sound systems to Internet music broadcast systems. The benefits of using RTP for MPEG-2 AAC data stream transport include: i. Provide increased packet loss resilience based on application layer framing. ii. Ability to synchronize MPEG-2 AAC streams with other RTP payloads iii. Monitoring MPEG-2 AAC delivery performance through RTCP iv. Combining MPEG-2 AAC and other real-time data streams received from multiple end-systems into a set of consolidated streams through RTP mixers v. Converting data types, etc. through the use of RTP translators. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [3]. 1.1 Overview of MPEG-2 AAC AAC combines the coding efficiencies of a high resolution filter bank, a powerful model of audio perception, backward-adaptive prediction, joint channel coding, and Huffman coding to achieve high-quality signal compression. In 1998 the MPEG Audio subgroup tested the family of MPEG audio coders (see http://www.tnt.uni-hannover.de/project/mpeg/ audio /public/w2006.pdf). The test results indicate that for a stereo signal, AAC at 96 kBit/s has audio quality comparable to MPEG-2 Layer 3 ("mp3") at 128 kBit/s. AAC is a block oriented, variable rate coding algorithm. An AAC encoder takes 1024 samples per channel at a time (a 'block') as input and the compressed representation is variable in size. Rate control can be used at the encoder to generate a constant-rate bitstream. Each block of AAC compressed bits is called a "raw data block", and can be decoded "stand-alone", that is, without information from prior raw data blocks. This feature is particularly useful for the delivery of AAC over lossy packet networks since the loss of a packet does not directly affect the decodability of the adjacent packets. Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 2] INTERNET-DRAFT RTP Payload Format for MPEG-2 AAC Streams October 1999 1.2 Bitstream Syntax The syntax of an AAC bitstream is as follows: => => [] where indicates the AAC bitstream, indicates intermediate tokens, indicates terminal tokens and [] indicates one or more occurrence. is a token that indicates the end of a raw_data_block and is a variable length token that forces the total length of a raw_data_block to be an integral number of byes. In general, intermediate tokens are not an integral number of bytes in length. The tokens are a string of bits of variable length, and they can be any of the following: a single audio channel a stereo presentation (2 channels) a mechanism for multi-channel compression a special effects channel "user data" a mechanism for describing the bitstream content a mechanism to use bits (for constant rate channels) The can occur several times in a single raw_data_block. For example, the raw_data_block for a 5.1 surround sound signal would be: ... corresponding to the center, left and right, left surround and right surround and effects channels. Multiple occurances of the are dis-ambiguated by means of a unique 4-bit id inside the . Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 3] INTERNET-DRAFT RTP Payload Format for MPEG-2 AAC Streams October 1999 2. Issues covered by this Payload Format 2.1 Repair Information to reconstruct lost AAC Frames A smart AAC decoder can mitigate the effects of lost packets using techniques such as interpolation in the spectral domain. However if the raw_data_block in a packet is perceptually significant and also highly unpredictable (e.g. the onset of a cymbal crash) then sender may choose to add repair information associated with that raw_data_block. We will call RepairData the variable size array containing such information. The RepairData in a given packet is typically associated with a raw_data_block. The association between the raw_data_block and the RepairData is obtained by means of a specific field called RSEQ. RepairData as defined here is a valid AAC raw_data_block. As an example, the RepairData can be a highly compressed monophonic version of the signal being transmitted. An AAC stereo signal coded at a rate of 96 kBit/s corresponds to an average raw_data_block size of 279 bytes. A RepairData version of that block, compressed to 16 kBit/s would be 46 bytes in length. Given that perceptually critical blocks might occur only once per 100 or more blocks, the average rate increase associated with this type of RepairData can be very low. Generally, the RepairData for a given AAC frame X SHALL be carried by a different RTP packet then the one that carries X. The usage of the RepairData information is similar to the one proposed in[4]. The OPTIONAL RepairData MAY be provided for every frame. RepairData can be generated in many ways including using two encoders, decoding followed by coding or processing the original bitstream. 2.2 Fragmentation of AAC Frames It is desirable to limit the size of the AAC frame to less than the path-MTU. If this is not possible, the frame can be fragmented across several RTP packets. Fragmentation MUST occur at boundaries. An RTP packet contains either an integer number of complete AAC frames or else contains fragments of a single AAC frame, only. Subsequent packets containing a fragmented AAC frame have a much simpler header that is just two bytes long. They can be identified by a REPAIRLEN value of 127. In this case UBITS indicates the number of unused bits in the first byte in the case that the fragment is not byte-aligned. The total length of the fragment can be determined from the total length of packet excluding the RTP header. The RESERVED field is reserved for future use. Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 4] INTERNET-DRAFT RTP Payload Format for MPEG-2 AAC Streams October 1999 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |X|REPAIRLEN |UBITS|RESERVED |AAC FRAGMENT | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2.3 Predictability of AAC Frames AAC frame predictability allows adaptive handling of packet losses and/or given bandwidth constraints. Every AAC frame belongs to one of the following three predictability classes: - 0: not predictable - 1: one side predictable (either L-predictable or R-predictable) - 2: two side predictable - 3: reserved An AAC frame that belongs to class 0 cannot be predicted from any other AAC frame in the bitstream. An AAC frame that belongs to class 1 can be predicted either from previous (R-predictable) or following (L-predictable) AAC frame but not from both. An AAC frame that belongs to class 2 can be predicted from the preceding or following AAC frame or from both. Predictability information is coded for every RTP AAC packet in the Predictability Quantifier (PQ) which is 2 bits in length. For a given RTP packet such PQs are organized in a predictability vector which represents a moving window of PQs, starting with the current packet's PQ followed by preceding packets' PQs. 2.4 Grouping and Interleaving of AAC Frames It is often desirable to group an integer number of AAC frames. The predictability of such an RTP packet is the predictability of the AAC frame in the RTP packet which is least predictable. AAC frames belonging to the same predictability class MAY be grouped into one RTP packet. Note that if frames of different predictabilities are grouped much of the usefulness of the predictability information is lost. The sequence numbers SEQ of the AAC frames and RSEQ of REPAIRDATA are used to restore the proper order on the receiver side. Grouping AAC frames into a single RTP packet is OPTIONAL. Grouping means delay and some applications may want very low delay. Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 5] INTERNET-DRAFT RTP Payload Format for MPEG-2 AAC Streams October 1999 2.5 Example RTP Packet Sequence The example below shows how a sequence of AAC frames (a...p) and their assigned predictabilities classes. +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 2 | 2 | 2 | 1 | 0 | 1 | 2 | 2 | 2 | 2 | 2 | 1 | 0 | 1 | 2 | 2 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The AAC frames MAY be grouped according to their predictability. R(x) is the RepairData information sent within the RTP packet: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |a g j|b h k|c i o| d f | e | l n | m | p | | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ | | |R(e) | | |R(m) | | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3. RTP AAC Payload Format The AAC specific RTP payload consists of a 32 or 64 bit header, a RepairData array which is variable in size and a variable number of AAC frames. The header contains a vector of Predictability Quantizers (PQ) specifying the packets' predictability classes. The X bit specifies if the header contains 12 or 28 PQs. At the the beginning of a session, if fewer packets have been transmitted/ received than there are PQs in the header then the extra PQs are invalid and MUST be set to 0 (on the sender side) and MUST be ignored (on the receiver side). REPAIRLEN specifies the length of the RepairData array expressed in 32bit words. REPAIRLEN MUST be between 0 and 95 32bit words. Values greater 95 are escape sequences and imply that no REPAIRDATA is present. REPAIRLEN MUST be set to 0 if the RepairData array is empty. Every REPAIRDATA array is preceded by a sequence number RSEQ and a length specifier RLEN. REPAIRDATA is OPTIONAL and can also be ignored. If REPAIRDATA is present then the first byte contains the type information RTYPE. It designates the type of the REPAIRDATA being used. Currently, only a duplicate AAC frame encoded at a lower bitrate is defined. This field allows for the addition of new repair types. If a sender does not provide packet predictability information it MUST set all PQs to 0. A client can ignore the information provided by PQs since PQs are not required for decoding the actual AAC frame. PQs provide hints that enable an intelligent decoder to improve the audio quality when packets are lost. Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 6] INTERNET-DRAFT RTP Payload Format for MPEG-2 AAC Streams October 1999 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |X|REPAIRLEN |PRD VECTOR | Header +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |PRD VECTOR (continued), if X==1 | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ |RTYPE |RSEQ |RLEN |REPAIRDATA 1 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | . | Repair | . | Data | . | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |RSEQ |RLEN |REPAIRDATA N | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | | | | | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ |SEQ |LEN |AAC FRAME 1 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | . | | . | | . | AAC | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Frames | |SEQ |LEN | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |AAC FRAME N | | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ PRD VECTOR: Predictability vector. It contains either 12 or 28 Predictability Quantifiers (PQ). The size of a PQ element is 2 bits. The first PQ refers to the predictability class of the current packet. The following PQs refer to the most recent previous packets. So, the vector looks like this: {PQ(t), PQ(t-1), PQ(t-2)...} The predictability class of a packet is that of the least predictable AAC frame that is contained in the packet. X: Vector Extension, the predictability vector uses 56 instead of 24 bits. Hence, another 32bit word is required. RTYPE: The type of REPAIRDATA. Currently, only a value of 0 is defined which refers to a more highly compressed AAC frame, for example one encoded in mono at 16 kBit/s. Any such frame MUST be encoded at the same sample rate. Future implementations should be assigned the values 1...127, while values between 128 and 191 are reserved. Values between 192 and 255 are designated for experimental purposes. Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 7] INTERNET-DRAFT RTP Payload Format for MPEG-2 AAC Streams October 1999 REPAIRLEN: The total number of 32bit words containing Repair Data for previous frames. If REPAIRLEN=0 or REPAIRLEN>95 then there is no repair information. REPAIRLEN MUST be between 0 and 95 32bit words. If REPAIRLEN equals 127 the packet contains middle or last fragment of a fragmented AAC frame. RSEQ: The SEQ number of the AAC frame REPAIRDATA belongs to. RLEN: The length in bytes of REPAIRDATA. REPAIRDATA: An 8bit aligned data array containing RepairData. This information can be ignored and is not mandatory. It SHOULD be provided to support the reconstruction of lost AAC frames using fewer bits than the original AAC frame. For an RTYPE of 0 the REPAIRDATA will be a valid AAC frame. SEQ: 12 bit. The sequence number of the AAC frame. The application has to make sure that the sequence numbers of interleaved frames to not overlap. LEN: 12 bit. The length of the actual AAC frame 3.1 RTP Header Fields Usage: The RTP header fields are used as follows: Payload Type (PT): The assignment of an RTP payload type for this new packet format is outside the scope of this document, and will not be specified here. It is expected that the RTP profile for a particular class of applications will assign a payload type for this encoding, or if that is not done then a payload type in the dynamic range shall be chosen. Marker (M) bit: Set to one to mark the last fragment (or only fragment) of an AAC frame. Extension (X) bit: Defined by the RTP profile used. Timestamp (TS): 32-bit 90K Hz timestamp representing sampling time of the first sample of the first AAC frame in the packet. It is recommended for all packets that make up the fragmented AAC frame. Timestamps start at a random value to improve security. SSRC: set as described in RFC1889 [2]. CC and CSRC fields are used as described in RFC 1889 [2]. RTCP SHOULD be used as defined in RFC 1889 [2] Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 8] INTERNET-DRAFT RTP Payload Format for MPEG-2 AAC Streams October 1999 4. Security Considerations RTP packets using the payload format defined in this specification are subject to the security considerations discussed in the RTP specification [2]. This implies that confidentiality of the media streams is achieved by encryption. Because the data compression used with this payload format is applied end-to-end, encryption may be performed on the compressed data so there is no conflict between the two operations. This payload type does not exhibit any significant non-uniformity in the receiver side computational complexity for packet processing to cause a potential denial-of-service threat. 4. References [1] ISO/IEC 13818-7 Advanced Audio Coding (AAC) [2] Schulzrinne, Casner, Frederick, Jacobson RTP: A Transport Protocol for Real Time Applications RFC 1889, Internet Engineering Task Force, January 1996. [3] S. Bradner, Key words for use in RFCs to Indicate Requirement Levels, RFC 2119, March 1997. [4] Perkins,Kouvelas,Hodson,Hardman,Handley,Bolot,Vega-Garcia, Fosse-Parisis RTP Payload for Redundant Audio Data draft-ietf-avt-redundancy-revised-00.txt [5] D. Hoffman, G. Fernando, V. Goyal, M. Civanlar RTP Payload Format for MPEG1/MPEG2 Video RFC 2250, Internet Engineering Task Force, January 1998. 5. Authors' Addresses Mathias Kretschmer AT&T Labs - Research 180 Park Ave. Florham Park, NJ 07932 USA e-mail: mathias@research.att.com Andrea Basso AT&T Labs - Research 100 Schultz Drive Red Bank, NJ 07701 USA e-mail: basso@research.att.com Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 9] INTERNET-DRAFT RTP Payload Format for MPEG-2 AAC Streams October 1999 M. Reha Civanlar AT&T Labs - Research 100 Schultz Drive Red Bank, NJ 07701 USA e-mail: civanlar@research.att.com Schuyler R. Quackenbush AT&T Labs - Research 180 Park Ave. Florham Park, NJ 07932 USA e-mail: srq@research.att.com James H. Snyder AT&T Labs - Research 180 Park Ave. Florham Park, NJ 07932 USA e-mail: jhs@research.att.com Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 10]