Internet Engineering Task Force           Kretschmer-AT&T/Basso-AT&T
INTERNET DRAFT                            Civanlar-AT&T/Quackenbush-AT&T
File:draft-ietf-avt-rtp-mpeg2aac-01.txt   Snyder-AT&T
                                          October 22, 1999
                                          Expires: April 22, 2000
                                                                        

                RTP Payload Format for MPEG-2 AAC Streams


                         STATUS OF THIS MEMO

This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026.

Internet-Drafts are working documents of the Internet Engineering Task
Force (IETF), its areas, and its working groups.  Note that other
groups may also distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time.  It is inappropriate to use Internet- Drafts as reference
material or to cite them other than as "work in progress."

The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt

The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.


                                 Abstract

This document describes a payload format for transporting MPEG-2 AAC
encoded data using RTP. MPEG-2 AAC is a recent standard from ISO/IEC
for coding multi-channel audio data. This payload format increases the
packet loss resilience of AAC coded audio transport above that of 'RTP
Payload Format for MPEG1/MPEG2 Video (RFC 2250)' [5] by incorporating
AAC properties into the payload format. Also, the MPEG-2 AAC
bitstream format is not backwards compatible with other MPEG-2 audio
formats. Several services provided by RTP are beneficial for MPEG-2
AAC encoded data transport over the Internet. Additionally, the use of
RTP allows for the synchronization of MPEG-2 AAC with other real-time
streams.


Kretschmer/Basso/Civanlar/Quackenbush/Snyder                    [Page 1]
INTERNET-DRAFT  RTP Payload Format for MPEG-2 AAC Streams  October 1999

1. Introduction

The ISO/IEC MPEG-2 Advanced Audio Coding (AAC) [1] technology delivers
CD-like or better multichannel audio quality at rates around 64
kbps/channel. It has a flexible bitstream syntax that supports from 1
to 48 audio channels, up to 16 subwoofer channels and up to 16
embedded data channels.  AAC supports a wide range of sampling
frequencies (from 16 kHz to 96 kHz) and an extremely wide range of
bitrates. AAC can support applications ranging from professional or
home theater sound systems to Internet music broadcast systems.

The benefits of using RTP for MPEG-2 AAC data stream transport include:

    i. Provide increased packet loss resilience based on application
    layer framing.

    ii. Ability to synchronize MPEG-2 AAC streams with other RTP payloads

    iii. Monitoring MPEG-2 AAC delivery performance through RTCP

    iv. Combining MPEG-2 AAC and other real-time data streams received
    from multiple end-systems into a set of consolidated streams
    through RTP mixers  

    v. Converting data types, etc. through the use of RTP translators.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [3].


1.1 Overview of MPEG-2 AAC

AAC combines the coding efficiencies of a high resolution filter bank,
a powerful model of audio perception, backward-adaptive prediction,
joint channel coding, and Huffman coding to achieve high-quality
signal compression.  In 1998 the MPEG Audio subgroup tested the family
of MPEG audio coders (see http://www.tnt.uni-hannover.de/project/mpeg/
audio /public/w2006.pdf). The test results indicate that for a stereo
signal, AAC at 96 kBit/s has audio quality comparable to MPEG-2 Layer 3
("mp3") at 128 kBit/s.

AAC is a block oriented, variable rate coding algorithm.  An AAC
encoder takes 1024 samples per channel at a time (a 'block') as input
and the compressed representation is variable in size.

Rate control can be used at the encoder to generate a constant-rate
bitstream. Each block of AAC compressed bits is called a "raw data
block", and can be decoded "stand-alone", that is, without information
from prior raw data blocks. This feature is particularly useful for
the delivery of AAC over lossy packet networks since the loss of a
packet does not directly affect the decodability of the adjacent
packets.


Kretschmer/Basso/Civanlar/Quackenbush/Snyder                    [Page 2]
INTERNET-DRAFT  RTP Payload Format for MPEG-2 AAC Streams  October 1999

1.2 Bitstream Syntax

The syntax of an AAC bitstream is as follows:

<bitstream>        => <raw_data_block><bitstream> 
<raw_data_block>   => [<element>]<END><PAD>

where <bitstream> indicates the AAC bitstream, <lowercase> indicates
intermediate tokens, <UPPERCASE> indicates terminal tokens and []
indicates one or more occurrence. <END> is a token that indicates the
end of a raw_data_block and <PAD> is a variable length token that
forces the total length of a raw_data_block to be an integral number
of byes. In general, intermediate tokens are not an integral number of
bytes in length.

The <element> tokens are a string of bits of variable length, and they
can be any of the following:

<single_channel_element>     a single audio channel
<channel_pair_element>       a stereo presentation (2 channels)
<coupling_channel_element>   a mechanism for multi-channel compression
<lfe_channel_element>        a special effects channel
<data_stream_element>        "user data"
<program_config_element>     a mechanism for describing the bitstream 
                             content
<fill_element>               a mechanism to use bits (for constant rate
                             channels)

The <elements> can occur several times in a single
raw_data_block. For example, the raw_data_block for a 5.1 surround
sound signal would be:
  
<single_channel_element><channel_pair_element>...
<channel_pair_element><lfe_channel_element><END>

corresponding to the center, left and right, left surround and right
surround and effects channels. Multiple occurances of the
<channel_pair_element> are dis-ambiguated by means of a unique 4-bit
id inside the <channel_pair_element>.


Kretschmer/Basso/Civanlar/Quackenbush/Snyder                    [Page 3]
INTERNET-DRAFT  RTP Payload Format for MPEG-2 AAC Streams  October 1999

2. Issues covered by this Payload Format

2.1 Repair Information to reconstruct lost AAC Frames

A smart AAC decoder can mitigate the effects of lost packets using
techniques such as interpolation in the spectral domain. However if
the raw_data_block in a packet is perceptually significant and also
highly unpredictable (e.g. the onset of a cymbal crash) then sender
may choose to add repair information associated with that
raw_data_block. We will call RepairData the variable size array
containing such information.  The RepairData in a given packet is
typically associated with a raw_data_block. The association between
the raw_data_block and the RepairData is obtained by means of a
specific field called RSEQ.

RepairData as defined here is a valid AAC raw_data_block.  As an
example, the RepairData can be a highly compressed monophonic version
of the signal being transmitted. An AAC stereo signal coded at a rate
of 96 kBit/s corresponds to an average raw_data_block size of 279 bytes.
A RepairData version of that block, compressed to 16 kBit/s would be 46
bytes in length.  Given that perceptually critical blocks might occur
only once per 100 or more blocks, the average rate increase associated
with this type of RepairData can be very low. Generally, the
RepairData for a given AAC frame X SHALL be carried by a different RTP
packet then the one that carries X.

The usage of the RepairData information is similar to the one proposed
in[4].  The OPTIONAL RepairData MAY be provided for every frame.
RepairData can be generated in many ways including using two encoders,
decoding followed by coding or processing the original bitstream.


2.2 Fragmentation of AAC Frames

It is desirable to limit the size of the AAC frame to less than the
path-MTU. If this is not possible, the frame can be fragmented across
several RTP packets. Fragmentation MUST occur at <element> boundaries.
An RTP packet contains either an integer number of complete AAC frames
or else contains fragments of a single AAC frame, only. Subsequent
packets containing a fragmented AAC frame have a much simpler header
that is just two bytes long. They can be identified by a REPAIRLEN
value of 127. In this case UBITS indicates the number of unused bits
in the first byte in the case that the fragment is not byte-aligned.
The total length of the fragment can be determined from the total
length of packet excluding the RTP header. The RESERVED field is
reserved for future use.


Kretschmer/Basso/Civanlar/Quackenbush/Snyder                    [Page 4]
INTERNET-DRAFT  RTP Payload Format for MPEG-2 AAC Streams  October 1999

0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|X|REPAIRLEN    |UBITS|RESERVED |AAC FRAGMENT			|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               |
|                                                               |
|                                                               |
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


2.3 Predictability of AAC Frames

AAC frame predictability allows adaptive handling of packet losses
and/or given bandwidth constraints. Every AAC frame belongs to one of
the following three predictability classes:

 - 0: not predictable
 - 1: one side predictable (either L-predictable or R-predictable)
 - 2: two side predictable
 - 3: reserved

An AAC frame that belongs to class 0 cannot be predicted from any other 
AAC frame in the bitstream.

An AAC frame that belongs to class 1 can be predicted either from
previous (R-predictable) or following (L-predictable) AAC frame but
not from both.

An AAC frame that belongs to class 2 can be predicted from the
preceding or following AAC frame or from both.

Predictability information is coded for every RTP AAC packet in the
Predictability Quantifier (PQ) which is 2 bits in length. For a given
RTP packet such PQs are organized in a predictability vector which
represents a moving window of PQs, starting with the current packet's
PQ followed by preceding packets' PQs.


2.4 Grouping and Interleaving of AAC Frames

It is often desirable to group an integer number of AAC frames. The
predictability of such an RTP packet is the predictability of the AAC
frame in the RTP packet which is least predictable. AAC frames
belonging to the same predictability class MAY be grouped into one RTP
packet. Note that if frames of different predictabilities are grouped
much of the usefulness of the predictability information is lost. The
sequence numbers SEQ of the AAC frames and RSEQ of REPAIRDATA are used
to restore the proper order on the receiver side.

Grouping AAC frames into a single RTP packet is OPTIONAL. Grouping
means delay and some applications may want very low delay.


Kretschmer/Basso/Civanlar/Quackenbush/Snyder                    [Page 5]
INTERNET-DRAFT  RTP Payload Format for MPEG-2 AAC Streams  October 1999

2.5 Example RTP Packet Sequence

The example below shows how a sequence of AAC frames (a...p) and their 
assigned predictabilities classes. 
  
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| 2 | 2 | 2 | 1 | 0 | 1 | 2 | 2 | 2 | 2 | 2 | 1 | 0 | 1 | 2 | 2 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The AAC frames MAY be grouped according to their predictability.
R(x) is the RepairData information sent within the  RTP packet:

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|a g j|b h k|c i o| d f |  e  | l n |  m  |  p  |           |
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
|     |     |R(e) |     |     |R(m) |     |     |     |     |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


3. RTP AAC Payload Format

The AAC specific RTP payload consists of a 32 or 64 bit header, a
RepairData array which is variable in size and a variable number of
AAC frames.

The header contains a vector of Predictability Quantizers (PQ)
specifying the packets' predictability classes.

The X bit specifies if the header contains 12 or 28 PQs.  At the the
beginning of a session, if fewer packets have been transmitted/
received than there are PQs in the header then the extra PQs are
invalid and MUST be set to 0 (on the sender side) and MUST be ignored
(on the receiver side).

REPAIRLEN specifies the length of the RepairData array expressed in
32bit words. REPAIRLEN MUST be between 0 and 95 32bit words. Values
greater 95 are escape sequences and imply that no REPAIRDATA is
present. REPAIRLEN MUST be set to 0 if the RepairData array is
empty. Every REPAIRDATA array is preceded by a sequence
number RSEQ and a length specifier RLEN. REPAIRDATA is
OPTIONAL and can also be ignored.

If REPAIRDATA is present then the first byte contains the type
information RTYPE. It designates the type of the REPAIRDATA being
used. Currently, only a duplicate AAC frame encoded at a lower bitrate
is defined. This field allows for the addition of new repair types.

If a sender does not provide packet predictability information it MUST
set all PQs to 0. A client can ignore the information provided by PQs
since PQs are not required for decoding the actual AAC frame. PQs
provide hints that enable an intelligent decoder to improve the audio
quality when packets are lost.

Kretschmer/Basso/Civanlar/Quackenbush/Snyder                    [Page 6]
INTERNET-DRAFT  RTP Payload Format for MPEG-2 AAC Streams  October 1999

0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|X|REPAIRLEN    |PRD VECTOR                                     | Header
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|PRD VECTOR (continued), if X==1                                |
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
|RTYPE          |RSEQ           |RLEN           |REPAIRDATA 1   |  
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               |
|				.                               | Repair
|				.                               | Data
|				.	                        |
|               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|               |RSEQ           |RLEN           |REPAIRDATA N   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               |
|                                                               |
|                                                               |
|                                                               |
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
|SEQ                    |LEN                    |AAC FRAME 1    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               |
|				.                               |
|				.                               |
|				.                               |  AAC
|               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  Frames
|               |SEQ                    |LEN                    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
|AAC FRAME N                                                    |
|                                                               |
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


PRD VECTOR: Predictability vector. It contains either 12 or 28 
	    Predictability Quantifiers (PQ). The size of a PQ element 
	    is 2 bits. The first PQ refers to the predictability class of
	    the current packet. The following PQs refer to the most 
	    recent previous packets. So, the vector looks like this: 
	    {PQ(t), PQ(t-1), PQ(t-2)...} 
	    The predictability class of a packet is that of the least
	    predictable AAC frame that is contained in the packet.

X:          Vector Extension, the predictability vector uses 56 instead 
	    of 24 bits. Hence, another 32bit word is required.

RTYPE:	    The type of REPAIRDATA. Currently, only a value of 0 is 
	    defined which refers to a more highly compressed AAC frame, 
	    for example one encoded in mono at 16 kBit/s. Any such frame
	    MUST be encoded at the same sample rate. Future implementations
	    should be assigned the values 1...127, while values between 
	    128 and 191 are reserved. Values between 192 and 255 are 
	    designated for experimental purposes.


Kretschmer/Basso/Civanlar/Quackenbush/Snyder                    [Page 7]
INTERNET-DRAFT  RTP Payload Format for MPEG-2 AAC Streams  October 1999

REPAIRLEN:  The total number of 32bit words containing Repair Data for
            previous frames. If REPAIRLEN=0 or REPAIRLEN>95 then there is
	    no repair information. REPAIRLEN MUST be between 0 and 95 32bit
	    words. If REPAIRLEN equals 127 the packet contains middle or last
	    fragment of a fragmented AAC frame. 


RSEQ:	    The SEQ number of the AAC frame REPAIRDATA belongs to.

RLEN:	    The length in bytes of REPAIRDATA.
          
REPAIRDATA: An 8bit aligned data array containing RepairData. This
	    information can be ignored and is not mandatory. It SHOULD
	    be provided to support the reconstruction of lost AAC frames
	    using fewer bits than the original AAC frame. For an RTYPE
	    of 0 the REPAIRDATA will be a valid AAC frame.  

SEQ:	    12 bit. The sequence number of the AAC frame. The application
	    has to make sure that the sequence numbers of interleaved 
	    frames to not overlap.

LEN:        12 bit. The length of the actual AAC frame


3.1 RTP Header Fields Usage:

The RTP header fields are used as follows:

Payload Type (PT): The assignment of an RTP payload type for this new
packet format is outside the scope of this document, and will not be
specified here. It is expected that the RTP profile for a particular
class of applications will assign a payload type for this encoding, or
if that is not done then a payload type in the dynamic range shall be
chosen.

Marker (M) bit: Set to one to mark the last fragment (or only
fragment) of an AAC frame.

Extension (X) bit: Defined by the RTP profile used.

Timestamp (TS): 32-bit 90K Hz timestamp representing sampling time of
the first sample of the first AAC frame in the packet.  It is
recommended for all packets that make up the fragmented AAC
frame. Timestamps start at a random value to improve security.

SSRC: set as described in RFC1889 [2].


CC and CSRC fields are used as described in RFC 1889 [2].


RTCP SHOULD be used as defined in RFC 1889 [2]


Kretschmer/Basso/Civanlar/Quackenbush/Snyder                    [Page 8]
INTERNET-DRAFT  RTP Payload Format for MPEG-2 AAC Streams  October 1999

4. Security Considerations

RTP packets using the payload format defined in this specification are
subject to the security considerations discussed in the RTP
specification [2]. This implies that confidentiality of the media
streams is achieved by encryption. Because the data compression used
with this payload format is applied end-to-end, encryption may be
performed on the compressed data so there is no conflict between the
two operations.


This payload type does not exhibit any significant non-uniformity in
the receiver side computational complexity for packet processing to
cause a potential denial-of-service threat.


4. References

  [1] ISO/IEC 13818-7 Advanced Audio Coding (AAC)

  [2] Schulzrinne, Casner, Frederick, Jacobson RTP: A
  Transport Protocol for Real Time Applications  RFC 1889,
  Internet Engineering Task Force, January 1996.

  [3] S. Bradner, Key words for use in RFCs to Indicate
  Requirement Levels, RFC 2119, March 1997.

  [4] Perkins,Kouvelas,Hodson,Hardman,Handley,Bolot,Vega-Garcia,
  Fosse-Parisis RTP Payload for Redundant Audio Data 
  draft-ietf-avt-redundancy-revised-00.txt

  [5] D. Hoffman, G. Fernando, V. Goyal, M. Civanlar
  RTP Payload Format for MPEG1/MPEG2 Video  RFC 2250,
  Internet Engineering Task Force, January 1998.


5. Authors' Addresses

Mathias Kretschmer
AT&T Labs - Research
180 Park Ave.
Florham Park, NJ 07932
USA
e-mail: mathias@research.att.com

Andrea Basso
AT&T Labs - Research
100 Schultz Drive
Red Bank, NJ 07701
USA
e-mail: basso@research.att.com


Kretschmer/Basso/Civanlar/Quackenbush/Snyder                    [Page 9]
INTERNET-DRAFT  RTP Payload Format for MPEG-2 AAC Streams  October 1999

M. Reha Civanlar
AT&T Labs - Research
100 Schultz Drive
Red Bank, NJ 07701
USA
e-mail: civanlar@research.att.com

Schuyler R. Quackenbush
AT&T Labs - Research
180 Park Ave.
Florham Park, NJ 07932
USA
e-mail: srq@research.att.com

James H. Snyder
AT&T Labs - Research
180 Park Ave.
Florham Park, NJ 07932
USA
e-mail: jhs@research.att.com


Kretschmer/Basso/Civanlar/Quackenbush/Snyder                    [Page 10]