3. Build the fixed-point FFT project and verify the results. Comparing the results
with the floating-point complex radix-2 FFT results obtained by running it on
the PC.
4. The FFT output samples are squared and placed in a data buffer named
spectrum []. Use CCS to plot the results by displaying the spectrum []and
re1 []buffer.
5. Profile the DSP run-time clock cycles for 128-point and 1024-point FFTs. Record
the memory usage of the fixed-point functions bit_rev()and fft().
7.5.2 Experiment 7B ± Radix-2 Complex FFT Using Assembly
Language
Although using intrinsics can improve the DSP performance, the assembly language
implementation has been proven to have the fastest execution speed and memory
efficiency for most applications, especially for computational intensive algorithms
such as FFT. The development time for assembly code, however, will be much
longer than that of C code. In addition, the maintenance and upgrade of assembly
code are usually more difficult. In this experiment, we will use C55x assembly routines
for computing the same radix-2 FFT algorithm as the fixed-point C function used
in Experiment 7A. The assembly FFT routine is listed in Table 7.5. This routine is
written based on the C function used for Experiment 7A, and it follows the C55x
C calling convention. For readability, the assembly code has been written to mimic
the C function of Experiment 7A closely. It optimizes the memory usage but not
the run-time efficiency. By unrolling the loop and taking advantage of the FFT butterfly
characteristics, the FFT execution speed can be further improved with the expense of
the memory space, see the exercise problems at the end of this chapter.
In fft.asm, the local variables are defined as structure using the stack relative
addressing mode when the assembly routine is called. The last memory location con-
tains the return address of the caller function. Since the status registers ST1and ST3 will
be modified, we use two stack locations to store the contents of these status registers at
entry. The status registers will be restored upon returning to the caller function. The
complex temporary variable is stored in two consecutive memory locations by using a
bracket with the numerical number to indicate the number of memory locations for the
integer data type.
The FFT implementation is carried out in three nested loops. The butterfly computa-
tion is implemented in the inner loop and the group loop is in the middle, while the
stages are managed by the outer loop. Among these three loops, the butterfly loop is
repeated most often. We use the local block repeat instruction, rptblocal, for the
butterfly loop and the middle loop to minimize the loop overhead. We also use parallel
instructions, modulo addressing, and dual memory access instructions to further
improve the efficiency of butterfly computation. By limiting the size of the loop, we
can place the middle loop inside the DSP instruction buffer queue (IBQ) as well. The
FFT computation is improved since the two inner loops are only fetched once from the
program memory each time we compute the groups and butterflies. The twiddle-factor
EXPERIMENTS USING THE TMS320C55X
341