# NEW SYSTOLIC ARRAYS FOR COMPUTATION OF THE 1-D DISCRETE WAVELET TRANSFORM

Sung Bum Pan and Rae-Hong Park

Department of Electronic Engineering, Sogang University C.P.O. Box 1142, Seoul 100-611, Korea psb@eevision1.sogang.ac.kr

# ABSTRACT

This paper proposes a new systolic array architectures for computation of the 1-D discrete wavelet transform (DWT). The proposed systolic array consists of L processing element (PE) arrays, where L denotes the number of levels. The proposed PE array computes only the product terms that are required for further computation in higher levels, and the outputs of lowpass and highpass filters are computed in alternate clock cycles. Therefore, the proposed architectures can compute the one-level DWT using a single architecture. Note that the proposed architectures do not require extra processing units whereas the conventional architectures need ones. The required time and hardware cost for computation of the DWT using the proposed systolic arrays are comparable to that of the conventional architectures. The proposed architectures can also be applied to subband decomposition by simply changing the filter coefficients.

#### 1. INTRODUCTION

Among many discrete orthogonal transforms, the discrete cosine transform (DCT), discrete sine transform (DST), and discrete Hartley transform (DHT) are very effective in transform coding applications to speech and image signals [1]-[2]. Computation of the discrete transforms is intensive, therefore, various VLSI architectures have been proposed [3]-[4].

Recently, special attention has been paid to the discrete wavelet transform (DWT) that can decompose a signal into a number of frequency bands [5]. Applications of the DWT are numerous such as speech analysis, computer vision, image coding, and so on. Computational complexity of the DWT is also high.

Several VLSI architectures have been proposed for computation of the 1-D DWT [6]-[9]. The first architecture was presented by Knowles that used many large multiplexers for storing intermediate results [6]. Parhi and Nishitani proposed the folded architecture that has shorter latency [8], however it requires high hardware cost with complex routing and control networks. Also, this architecture is not easily scalable for the filter size. Vishwanath *et al.* proposed systolic arrays implemented by the recursive pyramid algorithm (RPA) that require the routing network [9].

This paper proposes new systolic array architectures for computation of the 1-D DWT. The rest of the paper is structured as follows. Section II describes the proposed systolic array architectures for analysis and synthesis of the 1-D DWT. In Section III, the performance of the proposed and conventional 1-D DWT architectures is analyzed, and conclusions are given in Section IV.

## 2. PROPOSED SYSTOLIC ARRAY ARCHITECTURES

The proposed systolic array architectures for computation of the 1-D DWT are described. They are independent of the size of a sequence and the nature of wavelet.

For simplicity, 4-tap filters with three-level hierarchy is assumed in description of the proposed architecture. Fig. 1(a) shows a three-level analysis wavelet decomposition and Fig. 1(b) shows a three-level wavelet synthesis, where  $\downarrow 2 (\uparrow 2)$  denotes 2:1 decimation (interpolation) and G(H) represents a highpass (lowpass) wavelet filter. Note that a denotes an input sequence of the first-level with u' and v' being the lowpass and highpass output sequences of a decomposition, respectively, decimated by a factor of two. The lowpass component v', in turn, is the input of the second-level decomposition, with w' and x' denoting the decimated highpass and lowpass output sequences, respectively. Similarly, y' and z' are obtained. Three-level decomposition of the input sequence a results in decomposed sequences u', w', y', and z'. The sampling rates of the input se-

This research was supported by the Sogang University Research Grants in 1996.



Figure 1: Three-level 1-D DWT. (a) Analysis. (b) Synthesis.

quences a, v' and x' of the first, second, and third levels are given by f, f/2, and f/4, respectively, as shown in Fig. 1(a).

The transfer functions of the filters are represented as

$$G(z) = g_0 + g_1 z^{-1} + \dots + g_{M-1} z^{-(M-1)}$$
  

$$H(z) = h_0 + h_1 z^{-1} + \dots + h_{M-1} z^{-(M-1)}$$
(1)

where M represents the number of taps of the filters. G(z) and H(z) denote the highpass and lowpass system functions, i.e., z-transforms of the causal impulse response sequences  $\{g_n\}$  and  $\{h_n\}$ , respectively. The computational procedure of the first level 1-D DWT is described as

$$v_{0} = a_{0}h_{0}$$

$$v_{1} = a_{1}h_{0} + a_{0}h_{1}$$

$$v_{2} = a_{2}h_{0} + a_{1}h_{1} + a_{0}h_{2}$$

$$v_{3} = a_{3}h_{0} + a_{2}h_{1} + a_{1}h_{2} + a_{0}h_{3}$$

$$v_{4} = a_{4}h_{0} + a_{3}h_{1} + a_{2}h_{2} + a_{1}h_{3}$$

$$\vdots$$

where  $\{a_n\}$  denotes an input sequence and  $\{v_n\}$  represents the output sequence of the lowpass filter H(z).

Note that M = 4 is assumed and computation of the output  $v_{2n+1}$ ,  $n \ge 0$ , is not required since it is not used in the second and third levels because of 2:1 decimation. Therefore, only the product terms  $a_{2n}h_{2m}$  and  $a_{2n+1}h_{2m+1}$ ,  $m, n \ge 0$ , are computed. Similarly, we can obtain the sequence  $\{u_n\}$ , where only the product terms  $a_{2n}g_{2m}$  and  $a_{2n+1}g_{2m+1}$ ,  $m, n \ge 0$ , are computed. Thus, the output sequences  $v_{2n}$  and  $u_{2n}$ ,  $n \ge 0$ , are alternately obtained as

$$\begin{array}{rclrcl} v_0' = & v_0 = & a_0 h_0 \\ u_0' = & u_0 = & a_0 g_0 \\ v_1' = & v_2 = & a_2 h_0 + a_1 h_1 + a_0 h_2 \\ u_1' = & u_2 = & a_2 g_0 + a_1 g_1 + a_0 g_2 \\ v_2' = & v_4 = & a_4 h_0 + a_3 h_1 + a_2 h_2 + a_1 h_3 \\ u_2' = & u_4 = & a_4 g_0 + a_3 g_1 + a_2 g_2 + a_1 g_3 \\ & \vdots \end{array}$$

i.e., the first-level output sequences  $v'_n = v_{2n}$  and  $u'_n = u_{2n}$  are computed in alternate clock cycles. Same is the true for computation in the second and third levels. Therefore, the lowpass and highpass output sequences are computed by a single architecture, which is motivated by the inherent 2:1 decimation operation in Fig. 1(a).

Fig. 2 shows two proposed systolic array architectures for realization of the first level wavelet analysis



Figure 2: Systolic array architectures for one-level analysis filter. (a) Type I. (b) Type II.

filter, in which 2:1 decimation is not explicitly shown. Type I architecture consists of multiplication, delay, and addition PE's whereas Type II architecture consists of only multiplication PE's. Note that the complexity of the multiplication PE in Type II architecture is higher than that of Type I architecture. The coefficients  $h_n$  and  $g_n$  of a lowpass filter H(z) and highpass filter G(z) are stored in a multiplication PE in both realizations. Computation of  $\{v'_n\}$  and  $\{u'_n\}$  is done in alternate clock cycles. In Type I architecture, the A PE sums two inputs, and the D PE denotes the delay, where constants  $\alpha$ ,  $\beta$ , and  $\gamma$  represent the numbers of delays, specified by

$$\begin{array}{rcl} \alpha & = & 2^{l-1} \\ \beta & = & 2^l - 2, \\ \gamma & = & \alpha + \beta \end{array} \qquad \qquad 1 \leq l \leq L \\ \end{array}$$

with l denoting the lth level and L representing the number of levels employed in a wavelet decomposition. In our example with L = 3, in the first, second, and third level,  $\alpha$  is equal to 1, 2, and 4,  $\beta$  is equal to 0, 2, and 6, and  $\gamma$  is equal to 1, 4 and 10, respectively.

Fig. 3 shows the function definition of the PE's used in Type I and Type II architectures. We assume that the first clock cycle corresponds to that at  $t_0$ . The output c of the multiplication PE MI shown in Fig. 3(a) is differently given in every other clock cycles, with ah (ag) at even (odd) clock cycles. Fig. 3(b) shows the multiplication PE MII that transfers the input a with



Figure 3: Function definition of the PE's used in Fig. 2. (a)  $M_IPE$ . (b)  $M_{II}PE$ . (c) D PE. (d) A PE.



Figure 4: Systolic array architecture for a three-level wavelet decomposition.

a unit clock delay and computes the two outputs c and f. Outputs c and f are differently given in every other clock cycles, with ah + d + e (ag + d + e) and ah (ag), respectively, at even (odd) clock cycles. Fig. 3(c) shows the register that transfers the input a to output b with a unit clock delay and Fig. 3(d) shows a two-input adder.

For realization of a three-level wavelet analysis filter, three PE arrays are cascaded as shown in Fig. 4. Each PE array can be implemented by the systolic array architecture shown in Fig. 2(a) or 2(b). The input sequence  $\{a_n\}$  is fed into the first PE array, and the lowpass and highpass frequency outputs of the first level are computed by the first PE array. The decimated output  $\{u'\}$  of the highpass filter is fed out, and the output  $\{v'\}$  of the low frequency component is fed into the second PE array for the second-level decomposition. Note that at each level, the PE array implemented by Type I architecture has different  $\alpha$ ,  $\beta$ , and  $\gamma$  values (see eq. (2)) depending on the level index l. Note that the proposed analysis architecture can be realized for any L and M.

Fig. 1(b) shows a three-level wavelet synthesis. The computational procedure of the first level wavelet synthesis filter is similar to the first level wavelet analysis filter, as shown in Fig. 5. For realization of a three-level wavelet synthesis filter, three PE arrays are cascaded similarly to the case of a three-level wavelet analysis filter shown in Fig. 4. Also the proposed synthesis architecture can be realized for any L and M.

#### 3. PERFORMANCE ANALYSIS

Table 1 shows the performance comparison, in terms of the number of PE's and computation time, of the conventional and proposed architectures for computation of the 1-D DWT. Constants L, M, and T denote the number of levels, the number of filter taps, and the time required for one-level wavelet decomposition, respectively.

PE's used in the proposed and conventional archi-



Figure 5: Systolic array architectures for one-level synthesis filter. (a) Type I. (b) Type II.

| Table 1:  | Performance | e Comp   | arison o         | of Various | Architec- |
|-----------|-------------|----------|------------------|------------|-----------|
| tures for | Computati   | on of th | e 1 <b>-</b> D D | WT         |           |

|                        | Number of PE's | Computation time |
|------------------------|----------------|------------------|
| Lee <i>et al</i> .     | 2LM            | LT               |
| Parhi and<br>Nishitani | 2M             | T                |
| Vishwanath et al.      | 2M             | 2T               |
| Proposed               | LM             | 2T               |

tectures consist of a number of multipliers and adders, therefore, the PE complexity of the proposed 1-D DWT architecture is similar to that of the conventional architectures. The proposed systolic array architectures need LM multiplication PE's, whereas architectures by Lee et al. [7] and Vishwanath et al. [9] need 2LM and 2M multiplication PE's, respectively. However, the architecture of Vishwanath et al. needs routing network and memory unit. In terms of the computation time and the required number of multiplication PE's, Parhi and Nishitani's architecture [8] is better than the other architectures, however, this architecture requires extra memory unit and control unit. Note that the conventional architectures need memory unit, routing network, or other overhead processing units whereas the proposed systolic array architectures do not.

### 4. CONCLUSIONS

This paper proposes efficient systolic array architectures for computation of the 1-D DWT. The proposed systolic array for computation of the 1-D DWT consists of only L PE arrays. The proposed PE array computes only the product terms that are required for further computation in higher levels. Therefore, the proposed architectures can compute the one-level DWT using a single architecture. The proposed architectures can be applied to subband decomposition by simply changing the filter coefficients. Further research will focus on development of real-time hardware implementation.

### 5. REFERENCES

- N. Ahmed, T. Natarajan, and K. R. Rao, "Discrete cosine transform," *IEEE Trans. Commun.*, vol. COM-23, no. 1, pp. 90-93, Jan. 1974.
- [2] R. N. Bracewell, *The Hartley Transforms*. England: Oxford University Press, 1986.
- [3] N. I. Cho and S. U. Lee, "DCT algorithms for VLSI parallel implementations," *IEEE Trans. Acoust., Speech, Signal Processing*, vol. ASSP-38, no. 1, pp. 121-127, Jan. 1990.
- [4] S. B. Pan and R.-H. Park, "Unified systolic arrays for computation of the DCT/DST/DHT," *IEEE Trans. Circuits Syst. Video Tech.*, accepted for publication, 1997.
- [5] S. G. Mallat, "Multifrequency channel decompositions of images and wavelet models," *IEEE Trans. Acoust., Speech, Signal Processing*, vol. ASSP-37, no. 12, pp. 2091-2110, Dec. 1989.
- [6] G. Knowles, "VLSI architecture for the discrete wavelet transform," *Electron. Lett.*, vol. 26, no. 15, pp. 1184-1185, July 1990.
- [7] H. J. Lee, J. C. Liu, A. K. Chan, and C. K. Chui, "Parallel implementation of wavelet decomposition/reconstruction algorithms," in *Proc. SPIE Wavelet Applications*, Orlando, FL, vol. 2242, pp. 248-259, Apr. 1994.
- [8] K. K. Parhi and T. Nishitani, "VLSI architectures for discrete wavelet transforms," *IEEE Trans. VLSI Systems*, vol. 1, no. 2, pp. 191-202, June 1993.
- [9] M. Vishwanath, R. M. Owens, and M. J. Irwin, "VLSI architectures for the discrete wavelet transform," *IEEE Trans. Circuits Syst. II*, vol. CAS-42, no. 5, pp. 305-316, May 1995.