AMD Alchemy™
Au1200™ Processor
System Architecture

by: Jim Eno

Advanced Micro Devices
9500 Arboretum Blvd, Suite 400
Austin, TX 78759

© 2005 Advanced Micro Devices, Inc. All rights reserved.

Trademarks
AMD, the AMD Arrow logo, and combinations thereof, and Au1200 are trademarks of Advanced Micro Devices, Inc.
MIPS32 is a trademark of MIPS Technologies, Inc.
Windows is a registered trademark of Microsoft Corporation.
Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.
Abstract

AMD Alchemy™ has distilled an elegant personal media player (PMP) solution in the form of a system-on-a-chip that delivers D1 (720x480) support for MPEG1, MPEG2, MPEG4, DIVX 3,4,5, and H.263 multimedia formats. Designed to replace expensive DSP processors and devices that populate multi-processor designs, the Au1200™ processor consolidates the characteristics of low-power design and high-quality video processing into a single chip that sets full multimedia content free to be truly portable.

The Au1200 processor integrates a unified memory controller for 2.5-V and 1.8-V DDR SDRAM, Internet access peripherals, a Media Acceleration Engine (MAE), LCD controller, CCD/CMOS Camera Interface Module, USB 2.0 (high/full/low speed and OTG support), and AES-128 data encryption in hardware with a high-performance low power MIPS32™-compatible core that lends itself to simplicity of design at multiple performance points.

Like other processors in the AMD Alchemy™ family, the Au1200 processor runs Windows® CE, Linux, and others on frugal sips of power.
Introduction

While observing and discussing the commercial and technical success of personal media devices, AMD Alchemy™ engineers agreed that the future of PMP (personal media player) devices would be shaped by increasing consumer demands for diverse multimedia content. Specifically, they reasoned that the "next step" for PMPs would revolve around the addition of versatile, high-quality, high-performance video capabilities.

In current products, D1 content must be transcoded to a lower bit-rate and resolution, because PMPs either do not have the processing resources available to reproduce full-size/rate content, or are too power hungry. Transcoding taxes even desktop PCs having powerful CPUs, and often leaves little headroom available for other tasks. A contrasting consideration is that an abundance of MPEG2 content is currently available, yet no battery-powered handheld devices can play such un-transcoded content because it is too processor-intensive.

Design Priorities

The engineering team considered the inherent characteristics of a system-on-a-chip that would specifically support a low-cost, high-performance handheld multimedia player. The design goals that emerged required the team to develop a low-power/high-performance single chip solution that eliminates any need for a PC (or other device) to transcode content. The solution must support native D1 (720x480) video decoding for MPEG2, MPEG4, WMV9, H.263 and DivX that can be scaled up to 1024 x 768. Operating characteristics must meet ~300 mW @ 400MHz while running a media playback application. The solution must provide a simple programming model in a standard development environment that eliminates DSP complexities, and provides ample application headroom that supports a positive user experience. Small physical size must be maintained.
First Steps and the Media Acceleration Engine

The computational demands of decoding high-quality MPEG/WMV video content exceeded the capabilities of the low-power/high-performance processors that were of initial interest. The engineering team instead developed a hardware accelerator component capable of decoding popular block-based video formats, natively.

Eliminating the transcoding step was a blessing that allowed development efforts to address the demands of the video decoding process head-on. Because the popular MPEG/WMV9/H.263 formats each represent well-established standards, they could also serve as real-world benchmarks by which successful rendering of full-size full frame-rate video could be measured.

Mated on a single chip with the MIPS32™ processor core, the Media Acceleration Engine (MAE) is a low-power, low-cost hardware solution that eliminates any need to transcode content. During video uncompression, the core processor is only required to pass variable length decoding (VLD) data to the MAE.

Figure 1. Au1200™ Processor Core/MAE Video Decompression

For a more thorough discussion of MAE development and design characteristics refer to the Media Acceleration Engine section of this white paper suite.

The MAE autonomously performs inverse quantization, inverse direct cosine transform, motion compensation, WMV9 overlay smoothing & deblocking, color space conversion, scaling, and filtering tasks that typically occupy a lion’s share of CPU time. This division of work not only frees the core from the work of fully decoding compressed video, it also bestows extra versatility upon the MAE. Because MAE operation is predicated on the VLD data it receives from the core, the Au1200 processor is well-suited to decoding multiple block-based video formats.
Keeping Bandwidth High and Physical Area Low

The MAE/core processor system-on-a-chip resolved the initial issue of ensuring decoding capability for the popular video formats. However, much work remained to meet the PMP use model’s needs for low power consumption, to maintain necessary bandwidth, and to keep physical area and therefore, developer costs, low.

The engineering team designed a host of solutions that augment Au1200™ core processor functions. Some of these facilitate the video decoding process and augment application headroom. Some support content accessibility. Others ensure compatibility with the widest possible range of peripheral devices.

- A Scalable, Unified Memory Architecture
- A DDR Controller and Video Subsystem Memory Architecture
- Descriptor-based DMA Controller
- A Camera Interface Module
- LCD Controller
- A Static Bus Controller
- Cryptography Engine
- Power-Saving Operating Modes

Unified Memory Architecture

The single-chip solution set the engineering team free to discard the expense and engineering problems that are inherent in the multiple memory controller designs that are typical of CPU/DSP solutions. They also eliminated the SDR SDRAM bottleneck that plagues many (if not most) other PMP development solutions.

Instead, AMD Alchemy™ engineers designed a low-power, scalable, and unified memory architecture based on DDR (1Gbit device density) memory. The Au1200™ processor supports up to 512MB 2.5-V DDR1, 1.8-V DDR2, and Mobile DDR memory at speeds up to 500MHz (DDR2) in 16- or 32-bit implementations. Of significant note is that at the time of this writing DDR is less expensive than SDR SDRAM.

- Choosing 1.8-V DDR2 and Mobile DDR will allow developers to achieve lower power drain characteristics.
System Architecture

- While a 32-bit implementation will serve the PMP use model’s appetite for video and graphics data, developers may opt to employ 16-bit DDR for those lower-end products where memory bandwidth can be traded for cost considerations.

**DDR Controller and Video Subsystem Memory Architecture**

One challenging aspect of this SOC design was that the 64 bits per clock cycle appetite of the MAE exceeded the capacity of the system bus (SBus), which is saturated @32 bits per clock cycle as shown in Figure 2. Other bussing possibilities did not exist for the MAE for these reasons:

- The core runs at 333, 400, or 500 MHz, and the SBus runs nominally at one half speed of the core. The peripheral bus (PBus) runs at one-half speed of the SBus. All are synchronous.
- High-performance master devices and burstable slave devices are located on the SBus.
- The PBus is limited to use with slave-only devices.
- Schedule requirements precluded extending SBus data to 64 bits, and legacy blocks had to remain unchanged.

To balance the needs of the MAE with the capacity of the SBus, AMD Alchemy™ engineers designed a 64-bit side bus (called the RBUS) to serve the video subsystem (MAE & LCD). In addition, they designed a 32-bit DDR controller that exploits the operating characteristics of DDR by combining two 32-bit words per cycle to saturate the 64-bit RBUS, as Figure 2 illustrates.

**Figure 2. Leveraging RBUS / DDR Characteristics**
Descriptor-based DMA Controller

Functions that depend upon specialized registers and buffers to support memory transfers typically add physical bulk to processor designs. However, to minimize physical area and optimize bandwidth, AMD Alchemy engineers designed a descriptor-based direct memory access (DDMA) Controller with unique abilities that optimize data transfers and keep the Au1200 processor physical area requirements low. The DDMA Controller autonomously manages multiple sequential data transfers via a linked list of transfer descriptors, providing memory to memory transfers; memory to/from FIFO for on-chip or off-chip peripherals; and memory to/from Burst FIFO – including NAND Flash and off-chip peripherals. This design provides 16 DMA channels arbitrated as high- and low-priority pools that can be addressed using round robin or weighted priority techniques.

The DDMA Controller also augments DMA transfers with operating modes and data management strategies that are tailored to tasks, and to the processor’s intended use model that stresses the significance of portability, such as:

- Increment and decrement modes:
  Increment mode supports transfers of any byte count from any byte alignment, and decrement mode supports the requirement of secure digital devices for backwards data feed, without having to perform data reversal in buffer memory.

- Intelligent transfer management:
  The DDMA supports conditional data transfers by use of compare-and-branch and subroutine descriptors. These descriptors allow branching to commonly used descriptors with automatic return. They also provide the ability to poll an on-chip or off-chip register, and then branch to the next descriptor when the compare condition is valid.

CIM – Camera Interface Module

The nearly ubiquitous demand for imaging capabilities in portable consumer electronics led AMD Alchemy engineers to design the Au1200 processor with a Camera Interface Module supporting data input modes for CCD/CMOS sensors and CCIR656. Both formats are served by the MAE - which specifically provides Bayer pattern demosaic for CCD/CMOS; and scaling, color space conversion, and filtering for all CIM parallel operating modes. CIM input data is served by three separate FIFOs, moved to memory using the DDMA controller. A Raw Data mode supports moving CIM input data – unchanged – into memory.

More information about the CIM can be found in the Camera Interface Module section of this white paper suite.
System Architecture

LCD Controller

The LCD controller in the Au1200™ processor provides developers with full 32-bit aRGB capabilities in each of four prioritized overlay windows.

- No frame buffer modification is necessary for overlay window repositioning.
- Gamma correction for each window matches video display with graphics.
- A global background color helps aesthetically unify display panel contents while helping to minimize processing demands.
- A four-color alpha-capable hardware cursor is provided.
- An 8-Kbit palette RAM frame buffer suited to portable device idle modes can support low-power system information display, such as alarm times.

Static Bus

The static bus enhances Au1200 processor versatility and scalability by accommodating many devices that share interface similarities, namely: RAM, ROM, NOR flash, PCMCIA/CF, IDE, and NAND flash. Additionally, the static bus supports 10/100 Ethernet connectivity. A multiplexed address scheme and ALE protocol reduces the number of address pins while still allowing up to thirty bits of addressable memory.

IDE interface

The IDE interface provides a direct interface to the IDE drive, and provides design scalability factors by supporting simple DMA mode, PIO mode, and multi-mode DMA transfers that all provide an equivalent throughput (~ 80Mbps peak). The design goals for the PMP use model place a significant degree of importance upon the inclusion of an on-chip IDE interface. However, the available throughput for the IDE modes via the static bus is more than adequate to the task of delivering video and audio data to the system.

AES Encryption/Decryption

The AES Cryptography Engine accelerates digital rights management (DRM) algorithms used for content protection. The Au1200™ processor supports the U.S. government’s 128-bit Advanced Encryption Standard in ECB, CBC, CFB, and OFB modes. The data transfer capabilities of the DDMA Controller augment AES functions. When power vs. throughput is a consideration, four bandwidth choices deliver 44, 22, 11, and 5.5Mbps operation.
Power-saving Operating Modes

The Au1200 processor power-saving modes reflect the PMP use model’s power requirements.

Sleep Mode

Sleep mode is a low-power use state in which memory contents can be maintained by invoking DDR self-refresh. System initialization will be required, but can be optimized via saved state information.

- To optimize sleep state for lowest power, the internal power supply (VDDI) should be disabled during sleep. Time out of sleep is programmable, bound by VDDI rise times of 5ms, 20ms, and 100ms.

- For faster wake-up the VDDI can be kept active. However, power usage is dominated by VDDI leakage, and time out of sleep is bound by a PLL lock time of 5-30us.

Hibernate Mode

Hibernate mode allows the system to be powered off. A separate battery back-up on XPWR32 keeps the TOY (time-of-year clock) alive to allow for a periodic wakeup mechanism.

DDR Memory Power Management

The Au1200™ processor supports DDR memory power management for DDR1 and DDR2, but does not include the DDR2 protocol performance enhancements.

Three configurable power-down modes that support power-savings occur automatically, with no software interaction required:

- During idle periods, the DDR controller will automatically drive the clock-enable signal (DCKE) low.

- During idle periods, the DDR controller will automatically precharge on idle and drive DCKE low, which puts the DDR into “power down” mode, with all DDR banks closed.

- During idle periods, the DDR controller will drive DCKE low and wait for a specified idle time. When the specified idle time is reached, the DDR controller will precharge and drive DCKE low.
LCD Controller Power Saving Mode

The LCD controller supports power-savings modes with an 8-Kbit palette RAM frame buffer that can be loaded with low-resolution images. This LCD mode will allow refreshes to occur out of palette RAM while DDR is put in self-refresh. This reduces power consumption by allowing frame buffer accesses while DDR memory is in a low-power use state.

The Au1200™ Processor and the PMP Use Model

As a design aid that kept AMD Alchemy engineers focused on maintaining high usability, they developed customer-driven use models for a PMP featuring high quality audio/video and a nimble, responsive user interface. Content sources would include video files downloaded from a Web based movie distribution service, or portable digital video recorder, stored on an IDE device.

In this example for PMP application, Figure 3 shows data flow diagrams for the Au1200 processor alongside the PMP use model.

Figure 3. Data Flow Diagrams for an Au1200™ processor - based PMP

<table>
<thead>
<tr>
<th><strong>PVR/PC &gt; PMP use model</strong></th>
<th><strong>Data flow diagram</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>User:</strong> “I’m going on a trip... On the plane, I want to watch that new action movie everyone’s talking about. I’ve just downloaded the movie from the on-line rental site, and I’ll send it over to my PMP.”</td>
<td><img src="image" alt="Data Flow Diagram" /></td>
</tr>
<tr>
<td><strong>PMP Action:</strong> Move multimedia content from a personal video recorder or PC via USB 2.0 to an IDE drive on the PMP for portable viewing.</td>
<td></td>
</tr>
</tbody>
</table>
2. **User:** “Now that I’m relaxed on the plane, I think I’ll start that movie!”

**PMP Action A:** The DDMA Controller reads the multimedia content (MPEG1, MPEG2, MPEG4, DIVX 3,4,5, and WMV9) from the IDE drive and writes compressed video data into DDR memory:

**PMP Action B:** Au1 core reads compressed video from DDR memory and writes macroblock data back to DDR memory.

**PMP Action C:** MAE hardware reads DDR memory, uncompresses video to RGB, and writes RGB display data back into DDR memory.

**PMP Action D:** The LCD controller fetches RGB frame buffer to display.
Choosing the Right Tool

Hardware designers often wrestle with dilemmas that arise from conflicting needs to contain system cost or maximize system features. Because it is the first system-on-a-chip designed specifically to meet the needs of the PMP use model, developers will find that the Au1200™ processor replaces sacrifice with versatility, platform rigidity with scalability, and complexity with a simple programming model.

- Processor block functions have been designed so that access to their individual capabilities is maintained for the broad array of tasks presented by the PMP use model. For example, because the MAE is segmented into two functional entities, when the front end that provides essential hardware-based video decoding tasks is idle, the back end can independently provide scaling “on the fly”, color space conversion, and filtering tasks that are essential to camera interface module and LCD controller functions.

- Off-chip design scalability is facilitated by the unified memory architecture that supports the use of 16-bit DDR for small screen sizes (320 x 240) or for instances of use where D1 content is not a requirement.

- The Au1200 processor eliminates the need for multiple toolchains. The MIPS32™ core is supported by a single robust and well-established toolchain that reduces the cost of royalties, purchase of additional development software, and the complexities attendant with multiple-processor designs.
Block Diagram of the Au1200™ Processor

Figure 4 represents the Au1200 processor and its component hardware blocks. Note that the memory architecture is unified, and the RBUS serves the MAE and LCD controller blocks with 64-bits per clock cycle.

Figure 4. AMD Alchemy Au1200™ Processor Block Diagram