# A Dual-Issue RISC Processor for Multimedia Signal Processing

H. Sato E. Holmann

T. Yoshida

M. Matsuo

T. Kengaku

Mitsubishi Electric Corporation, System LSI Laboratory 4-1 Mizuhara, Itami, 664 Japan

sato@lsi.melco.co.jp

#### ABSTRACT

This paper presents the architecture of a newly-developed dual-issue RISC processor, D10V, that achieves both high throughput signal processing capability and maintains flexibility for general purpose applications. To achieve adequate performance for signal processing, this RISC processor operates both a MAC unit and a memory access unit in parallel, where two-word data memory access is supported. As the result of several benchmarks illustrate, the D10V competes favorably and in some instances outperforms conventional DSPs.

#### 1. INTRODUCTION

Today's digital signal processing applications require the use of state-of-the-art processors to meet the stringent requirements of their implementation environments. These performance requirements include high performance and low power dissipation. In addition, however, low overall system cost as well as ease of programmability are also very important goals that must be met by the system designer.

The typical approach in the design of a signal processing system has been to incorporate a digital signal processing (DSP) chip and a master controlling processor (MCU) on a board. Several manufacturers have taken this approach further by integrating the DSP and MCU into a single chip. In these dual-engine systems, the DSP chip performs as a co-processor that handles all the computationally intensive tasks while the MCU serves as the system controller. Even though current DSPs have had more than adequate computational performance capabilities, they are hard to program for general purpose applications and are typically inefficient for controlling tasks.

An emerging approach has been to enhance existing basic RISC processors with signal processing functions [1]. These typically retain the RISC flavor but fail to achieve the necessary computational performance required in many signal processing applications. A different approach has been to create a new hybrid architecture [2,3,4]; these

achieve high signal processing capability but they are very similar to the dual-engine methodology where the DSP unit functions as a co-processor to the integer unit and are thus difficult to program.

A promising approach for a processor for signal processing applications is a basic dual-issue RISC architecture. This approach has the advantages of being very easy to program for general purpose applications; allowing higher clock frequencies and thus higher throughput because of the RISC design methodology; and having the potential for a high degree of resource utilization by constraining the instruction word to two instructions.

## 2. ARCHITECTURE

Figure 1 illustrates the dual-issue RISC processor architecture, D10V, designed for both general purpose as well as for signal processing applications. The D10V has two asymmetric execution units and a register file. Both operation units execute basic arithmetic and logical operations. Branch and load/store operations are executed only at the Memory Unit. Multiply, shift, and other arithmetic operations that include the accumulators are executed only at the Integer Unit. The register file functions as a buffer memory between the Memory Unit and the Integer Unit when the D10V executes digital signal processing applications.

The Memory Unit (MU), on the left side, includes a program sequence controller, a memory access controller, and a 16-bit ALU. The MU is connected to a unified data memory through a 32-bit wide data bus. For memory operation, single cycle data moves are supported for signed and unsigned bytes, words, and two-word elements. Addressing modes include modulo operation, post-increment, post-decrement, and pre-decrement. The program sequence controller includes a block repeat circuit for zero delay penalty for loops and it functions concurrently with other operations executed on the MU.



Figure 1. Block Diagram

The Integer Unit (IU), on the right side, includes a 16 x 16-bit multiplier, a 40 x 16-bit barrel shifter, an ALU that executes 40-bit arithmetical operation and two 40-bit accumulators that incorporate saturation logic. The multiplier can operate on both integer and fixed-point format operands. The multiplier product can be fed into the ALU; the multiplier and the ALU are pipelined when using MAC instructions.

The register file unit comprises sixteen 16-bit general purpose registers, and is shared between the MU and the IU. Each register has four output and three input ports. Three of them are connected to the IU; one of the three ports is connected to a 32-bit bus for 32-bit wide data moves or calculations that involve the accumulators. The other four ports are connected to the MU; two of the four ports are connected to 32-bit buses for two-word load/store operations. For double-width buses, even and odd numbered registers are connected to the upper and lower order 16-bits of the 32-bit bus, respectively.

#### 3. INSTRUCTION

The instruction decode unit decodes 32-bit long instruction. The M-Decoder and I-Decoder decode the MU-side and IU-side sub-instructions of the VLIW instruction, respectively. Figure 2 shows the instruction format and issuing order of sub-instructions for the D10V. A 32-bit D10V instruction consists of two 15-bit "containers" and two format specifying bits (FM). These FM bits specify the instruction format and the issuing order of the sub-instructions in the containers.

For FM=11, a single long RISC sub-instruction with a 16-bit immediate field is encoded in both containers, and



Figure 2. Instruction Format and Issuing Order of Sub-instructions

is executed in the MU or IU depending on the instruction. For the other three values of FM (00, 01, 10), two short RISC sub-instructions are encoded in the left and right containers. In the general case, the MU and IU execute the sub-instructions in left container and right container, respectively. For FM=00, both of these sub-instructions are executed in parallel. For FM=01 or FM=10, both sub-instructions are executed in consecutive cycles, reducing the number of NOPs due to data dependencies between both sub-instructions. For the serial execution encoding, the first sub-instruction is executed in the MU when FM=01 or the IU when FM=10. The second sub-instruction is then executed, in the consecutive cycle, in the



Figure 3. Pipeline Scheme



Figure 4. Instruction Execution Pipeline

MU except the instructions involving the multiplier, the shifter, and accumulators which are executed in the IU.

The instruction execution pipeline of the D10V consists of four stages as illustrated in Figure 3 and Figure 4. Many integer and jump instructions (Op. instructions) are executed within three stages. These instructions are executed in the execution stage (E-stage). such as load/store and multiply-andaccumulate instructions (MAC) are executed in four stages. Load/store instructions are executed in the MU and calculate the operand addresses in the address generation stage (A-stage); memory is accessed in the memory access stage (M-stage). Load-byte instructions are an exception and need a write-back stage (W-stage) after the M-stage. MAC instructions are executed in the IU and multiply two 16-bit data in the first execution stage (Ef-stage), and the product is accumulated in the second execution stage (Esstage). When MAC instructions are decoded consecutively, a new multiply is initiated every clock cycle.

# 4. OPERATION

For general purpose applications, the D10V functions as a dual-issue RISC processor and can benefit from compiler advances developed for superscalar machines. For digital signal processing applications, the MU and IU operate in parallel, and function as address generation unit and data arithmetic unit similarly to conventional DSP. Thus, the D10V delivers five operations per cycle for a sustained high throughput: two-word load, address pointer



Figure 5. Data Flow of FIR Filtering



Figure 6. Pipeline Operation of FIR Filtering

update, multiply, and accumulate.

Figure 5 illustrates the data flow for a single precision operation of FIR filtering with coefficients h(i), input x(n)and output y(n), which is a typical example for digital signal processing applications. For this operation, both units, MU and IU, work as follows: (1) an address pointer R8 is used with post-incrementation to load two data elements, x(n-i) and x(n-i+1), from the unified data memory into two registers, R0 and R1; (2) an address pointer R9, with post-incrementation, is used to load two coefficients, h(i) and h(i+1) into R4 and R5; (3) R0 and R4 are multiplied and accumulated in A0; and (4) R1 and R5 are multiplied and accumulated. Using an additional four registers, these operations are scheduled with zero pipeline stall as illustrated in Figure 6. This is the core block for FIR filtering which is executed within a block repeat loop with zero overhead.

Table 1. Summary of hardware features.

| Performance                               |                           |
|-------------------------------------------|---------------------------|
| Clock frequency                           | : 52 MHz                  |
| Peak performance                          | : 104 MIPS                |
| Architecture                              |                           |
| 2-way VLIW                                | : 32-bit long             |
| GP register                               | : 16 x 16-bit             |
| Accumulator                               | : 2 x 40-bit              |
| Data bus                                  | : 32-bit wide             |
| Enhancement for digital signal processing |                           |
| Pipelined MAC                             | $(16 \times 16 + 40)$ bit |
| Zero delay penalty loop                   |                           |
| Powerful addressing mode                  |                           |
| (register indexed, modulo,                |                           |
| post-increment, etc.)                     |                           |

# 5. PERFORMANCE

Table 1 summarizes the D10V's hardware features. The CPU achieves peak performance of 104 MIPS when it is operates at 52 MHz clock. Figure 7 illustrates performance statistics of the D10V for various signal processing benchmarks [5]. We compare the required clock cycles and execution time for all benchmarks with conventional 16-bit DSPs. This shows that the D10V competes favorably, and in some instances outperforms. conventional DSPs. In these benchmarks 76% of the sub-instructions are executed in parallel and 74% of data memory access are two-word accesses. This shows that the proposed approach, a dual-issue RISC, is very suitable and successfully performs digital signal processing applications.

### 6. CONCLUSION

A dual-issue RISC processor, D10V, has been developed for both general purpose as well as for signal processing applications. Several enhancements to a basic RISC architecture are incorporated in the D10V to improve its performance. Single cycle data moves are supported for signed and unsigned bytes, words, and double-word elements. The core includes a pipelined multiply-and-accumulate instruction with two 40-bit accumulators allowing a new multiply to be initiated every clock cycle and a block repeat instruction for loops with zero delay penalty. Several digital signal processing benchmarks illustrate that the D10V functions as an efficient digital signal processing processor as well as a general purpose



Figure 7. Benchmarks for the D10V one.

## **ACKNOWLEDGMENTS**

The authors would like to acknowledge the technical contributions of K. Nakata, T. Asai, T. Hiraki, T. Maruyama, and T. Tatsumi. We would also like to thank Dr. K. Saitoh for his encouragement.

#### REFERENCE

- [1]K. Nadehara, M. Hayashida, and I. Kuroda. "A Low-Power, 32-bit RISC Processor with Signal Processing Capability and its Multiply-Adder." In VLSI Signal Processing, VIII, pp. 51-60, October 16-18, 1995.
- [2]Dan Mansur. "Future Communications Processors Will Fuse DSP and RISC." *Electronic Design*, pp. 99-102, January 8, 1996.
- [3]Jim Turley. "Hitachi Adds FP, DSP Units to SuperH Chips." *Microprocessor Report*, vol. 9, no. 16, pp. 10-11, December 4, 1995.
- [4] Markus Levy. "RISC vs. RISC: Comparing μP Architectures." *EDN*, pp. 81-96, April 11, 1996.
- [5] Berkeley Design Technology, Inc., Buyer's Guide to DSP Processors (1995 edition). Fremont, California: Berkeley Design Technology, Inc., 1995.