FFT Processing
Building blocks for frequency-domain transforms.
Note
For utilities that operate on blocks in the frequency domain, see Spectral Processing.
FFTs
- class tiliqua.dsp.fft.FFT(*args, src_loc_at=0, **kwargs)
Fixed-point Fast Fourier Transform.
Overview
This core computes the DFT of complex, fixed-point samples (specifically, power-of-two chunks, represented by
BlockofCQ). It can be run in ‘forward’ or ‘inverse’ mode, and is generally designed to match the behaviour ofscipy.fft(norm="forward").Input and output samples are clocked in FIFO order. Block processing only begins once
szelements (an entire block) has been clocked atself.i. Some time later, the core clocks the transformed block outself.o, before clocking in the next block. See the ‘Design’ section for further details on throughput.Switching modes
When the core is idle (nothing is clocked into the core and
i.readyis strobed), theifftsignal may be set to 1 to put the core into ‘inverse FFT’ mode. Otherwise, it is in ‘forward FFT’ mode.In forward mode, this FFT core normalizes the outputs by 1/N, and the twiddle factors are not conjugated.
In ‘inverse FFT’ mode, there is no normalization, and the twiddle factors are conjugated as required.
Design
There are many tradeoffs an FFT implementation can make. This one aims for low area and DSP tile usage. It is an iterative implementation that only performs 2 multiplies at a time. The core FFT loop takes 6 cycles, where 2 of these are arithmetic and the rest are memory operations. The algorithm implemented here is ‘Cooley-Tukey, Decimation-in-Time, Radix-2’.
Latency, resource usage
As this core only performs 2 multiplies at a time, it only requires 2 DSP tiles, assuming the
shapespecified is small enough. For 16-bit samples at an FFT size of 1024, the extra bits needed to account for overflow requires 4x 18-bit multipliers on an ECP5.Outputs take
sz*log2(sz)*6system clocks to be computed after all the inputs have been clocked in. That is, an FFT of size 1024 would take aboutlog2(1024)*6 == 60system clocks per sample, for a maximum throughput around 1Msample/sec if we had a 60MHz master clock (in reality a bit less accounting for the input and output clocking times).This core instantiates 3 memories, which all scale with
sz. A ROM for the twiddle factor (coefficient) memory, and 2 RAMs which are shared for input and output sample storage as well as intermediate calculations. An interesting addition to this core in the future may be to support external memory for huge FFT sizes.Internals
Note
Reading this is not necessary to use the core. I advise reading up a bit on how FFT algorithms work before diving deeper into this. It’s just a few notes to anyone interested in understanding this core more deeply.
This core has 3 main ‘phases’ in its overall state machine - an input phase (
LOAD1andLOAD2), calculation/loop phases (FFTLOOPtoWRITE-D) and an output phase (READ-OUTPUT).The loop phase has an inner counter
idx, for the index of the current sample compared to the total block size, and an outer counterstagewhich goes from 0 tolog2(sz). Oncestagereacheslog2(sz), the calculations are complete. The input and output phases are simpler to understand, so we’ll focus on this loop phase.On every loop iteration, we read all needed values from memory and calculate the butterfly sum/difference outputs. Muxes are shown to represent the A / B ping-ponging between which memory they fetch from in even and odd stages (Stage 0
X->A, Y->B, Stage 1X->B, Y->Aand so on up to Stagelog2(sz)).┌─────────┐ ┌─────────┐ ┌─────────┐ │ X RAM │-\ /-│ Y RAM │ │ W ROM │ └────┬────┘ X └────┬────┘ └────┬────┘ ▼ / \ ▼ │ MUX<──── ────>MUX │ │ │ │ A B │ │ │ │ │ ┌─────┴─────┐ │ └────────>┤ Butterfly ├<────────┘ │ S=A+B×W │ │ D=A-B×W │ └─────┬─────┘ │S,DOnce the calculation is complete, the S, D values are written back in a similar ping-pong way to either the X or Y RAMs, depending on which ‘stage’ we are in. There are 6 states inside the FFT loop required for each iteration of this calculation. Some of them do arithmetic and memory accesses in the same state - in general the timing of each state is optimized to hit around 80MHz on a slowest speed-grade ECP5.
- Members:
i (
In(stream.Signature(Block(CQ(self.shape))))) – Incoming stream of blocks of complex samples.o (
Out(stream.Signature(Block(CQ(self.shape))))) – Outgoing stream of blocks of complex samples.ifft (
In(1, init=default_ifft)) –0for forward FFT with1/Nnormalization.1for inverse FFT. May only be changed when the core is idle (‘ready’ and not midway through a block!).
- __init__(shape=fixed.SQ(1, 15), sz=1024, default_ifft=False)
- shapefixed.SQ
Shape of the fixed-point types used for inputs and outputs. This is the shape of a single number, this core wraps it in
Block(CQ(shape)).- szint
Size of the FFT/IFFT blocks. Must be a power of 2.
- default_ifftbool
Default state of the
self.ifftsignal, can be used to create an IFFT instead of an FFT by default, without needing to explicitly connect the signal.
Windowing
- class tiliqua.dsp.fft.Window(*args, src_loc_at=0, **kwargs)
Pointwise window function.
Apply a real window function of size
szto blocks ofshape. The window function is synchronized to the blockpayload.first.Design
This core is iterative and uses a single multiplier. It does not store any samples, the windowed samples are available at the output 2 clocks after they are provided at the input.
- Members:
i (
In(stream.Signature(Block(self.shape)))) – Incoming stream of blocks of real samples. at intervals matching theszof this component.o (
Out(stream.Signature(Block(self.shape)))) – Outgoing stream of blocks of real samples. at intervals matching theszof this component.
- class Function(value, names=_not_given, *values, module=None, qualname=None, type=None, start=1, boundary=None)
Enumeration representing different window functions for
Window.- HANN(sz)
- SQRT_HANN(sz)
- RECT(sz)
- __init__(shape, sz, window_function=Function.SQRT_HANN)
- shapefixed.SQ
Shape of the fixed-point types used for inputs and outputs. This is the shape of a single number, this core wraps it in
Block(shape).- szint
Size of the window function.
- window_functionFunction
Function used to calculate window function constants (type of window).
Overlap / Add
- class tiliqua.dsp.fft.ComputeOverlappingBlocks(*args, src_loc_at=0, **kwargs)
Real sample stream to (overlapping)
Blockstream conversion.This core is a building block for the
STFTProcessorfamily of components. It takes a continuous (non-delineated) stream of samples, and sends delineated blocks of samples, which may be overlapping, for use by further processing.Here is a quick example to illustrate. From the input perspective, each sample would correspond to the following output samples (taking sz=8, n_overlap=4):
Input: 0 1 2 3 4 5 6 7 8 9 A B C D E F ... Block 1: 0 1 2 3 4 5 6 7 Block 2: 4 5 6 7 8 9 A B Block 3: 8 9 A B C D E F
Looking at this from the output perspective:
payload.sample: 0 1 2 3 4 5 6 7 4 5 6 7 8 9 A B 8 9 A B C D E F payload.first: 1 1 1
To avoid backpressure on the input stream, the incoming sample rate should be much slower than the system clock, and rate of processing the output blocks (which is easily achievable with audio signals).
Design
This core uses a single
SyncFIFOfor storage of sizen_overlap, which is continuously filled and drained between blocks to create overlapping outputs.- Members:
i (
In(stream.Signature(self.shape))) – Incoming (continuous) stream of real samples.o (
Out(stream.Signature(Block(self.shape)))) – Outgoing stream of blocks of real samples. at intervals matching theszof this component.
- __init__(shape, sz, n_overlap)
- shapefixed.SQ
Shape of the fixed-point types used for inputs and outputs. This is the shape of a single number, this core wraps it in
Block(shape).- szint
Size of the output blocks.
- n_overlapint
Number of elements shared between adjacent output blocks.
- class tiliqua.dsp.fft.OverlapAddBlocks(*args, src_loc_at=0, **kwargs)
Convert
Blockstream to continuous samples by ‘Overlap-Add’.This core is a building block for the
STFTProcessorfamily of components. It takes overlapping, delineated blocks of samples, adds up the overlapping elements and sends a stream of (non-delineated) continuous samples.Again, an example for (sz=8, n_overlap=4). From the input perspective, say we have:
payload.sample: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J ... payload.first: 1 1 1
This core would send out the following stream:
Block 1: 0 1 2 3 4 5 6 7 Block 2: 8 9 A B C D E F Block 3: G H I J ... Out: 0 1 2 3 4 5 6 7 C D E J + + + + + + + + + + + + X X X X 8 9 A B G H I F ... (X == zero padding)From above, it should be clear that each output sample is formed by summing the ‘overlapping’ parts of each input block, and that there are less output samples than input samples as a result.
Note that connecting a
ComputeOverlappingBlocksstraight to anOverlapAddBlockswill not reconstruct the original signal, unless the samples are windowed (tapered) for perfect reconstruction. For an example, see theSTFTProcessorfamily of components.Design
This core uses two
SyncFIFOfor storage of sizesz, which are both continuously filled and drained as necessary to add up the correct overlapping parts.- Members:
i (
In(stream.Signature(Block(self.shape)))) – Incoming stream of blocks of overlapping real samples.o (
Out(stream.Signature(self.shape))) – Outgoing stream of continuous real samples.
- __init__(shape, sz, n_overlap)
- shapefixed.SQ
Shape of the fixed-point types used for inputs and outputs. This is the shape of a single number, this core wraps it in
Block(shape).- szint
Size of the input blocks.
- n_overlapint
Number of elements shared between adjacent input blocks.
STFTs
- class tiliqua.dsp.fft.STFTProcessor(*args, src_loc_at=0, **kwargs)
Short-Time Fourier Transform (‘Overlap-Add’) with shared FFT core.
This core performs both analysis and re-synthesis of time-domain samples by passing them through the frequency domain for spectral processing by user logic.
The general flow of operations:
i(samples in) ->ComputeOverlappingBlocks->Window->FFT->o_freqi_freq->FFT(inverse) ->Window->OverlapAddBlocks->o(samples out)
If
o_freqandi_freqfrequency-domain streams are directly connected together with no processing logic, this core will perfectly resynthesize the original signal.Window.Function.SQRT_HANNwindowing is used in both the analysis and synthesis stages.Design
Unlike
STFTProcessorPipelined, this implementation saves area by time-multiplexingFFTandWindowcores between analysis and synthesis operations, at the cost of reduced throughput. For audio purposes, this core is easily still fast enough for real-time processing.At a high level, the components are connected in a different sequence depending on which state the core is in. The core contains a single SyncFIFO used to buffer all the outputs of user (frequency domain) processing logic. This is required to avoid backpressure on the user logic while it is disconnected from the IFFT stage.
- Members:
i (
In(stream.Signature(shape))) – Continuous time-domain input streamo (
Out(stream.Signature(shape))) – Continuous time-domain output stream (reconstructed)o_freq (
Out(stream.Signature(Block(CQ(shape))))) – Frequency-domain output blocks for processing by user logici_freq (
In(stream.Signature(Block(CQ(shape))))) – Frequency-domain input blocks after processing by user logic
- __init__(shape, sz)
- shapefixed.SQ
Shape of the fixed-point samples used for inputs and outputs.
- szint
Size of the frequency-domain blocks used internally and exposed for frequency domain processing.
o_freqandi_freqare delineated by blocks with apayload.firststrobe everyszelements.
- class tiliqua.dsp.fft.STFTAnalyzer(*args, src_loc_at=0, **kwargs)
Short-Time Fourier Transform for spectral analysis.
This core allows frequency-domain analysis of time-domain samples by passing them through the following signal flow:
i(samples in) ->ComputeOverlappingBlocks->Window->FFT->o
For a more resource-efficient STFT that performs both analysis and synthesis, see
STFTProcessor.- Members:
i (
In(stream.Signature(shape))) – Continuous time-domain input streamo (
Out(stream.Signature(Block(CQ(shape))))) – Frequency-domain output blocks for processing by user logic
- __init__(shape, sz, window_function=Window.Function.HANN)
- shapefixed.SQ
Shape of the fixed-point samples used for inputs and outputs.
- szint
Size of the frequency-domain blocks used internally and exposed for frequency domain processing.
- windowWindow.Function
Window function applied to time-domain blocks the FFT.
- class tiliqua.dsp.fft.STFTSynthesizer(*args, src_loc_at=0, **kwargs)
Short-Time Fourier Transform for spectral synthesis.
This core allows frequency-domain synthesis of time-domain samples by passing them through the following signal flow:
i_freq->FFT(inverse) ->Window->OverlapAddBlocks->o(samples out)
For a more resource-efficient STFT that performs both analysis and synthesis, see
STFTProcessor.- Members:
i (
In(stream.Signature(Block(CQ(shape))))) – Frequency-domain input blocks used for synthesis.o (
Out(stream.Signature(shape))) – Continuous time-domain output stream
- __init__(shape, sz, window_function=Window.Function.HANN)
- shapefixed.SQ
Shape of the fixed-point samples used for inputs and outputs.
- szint
Size of the frequency-domain blocks used internally and exposed for frequency domain processing.
- windowWindow.Function
Window function applied to time-domain blocks after the IFFT, but before overlap-add.
- class tiliqua.dsp.fft.STFTProcessorPipelined(*args, src_loc_at=0, **kwargs)
Short-Time Fourier Transform (‘Overlap-Add’) with separate FFT/IFFT cores.
This core performs both analysis and re-synthesis of time-domain samples by passing them through the frequency domain for spectral processing by user logic.
This is a simpler-to-understand version of
STFTProcessoras it does not share theWindoworFFTcores, making it higher throughput, but also higher resource usage, because it can perform the FFT and IFFT at the same time.This core is mostly used for testing the
STFTAnalyzerandSTFTSynthesizercomponents in isolation, and doesn’t make much sense to use in real projects compared to just using theSTFTProcessorabove, which is much more resource efficient for audio applications. Unless you have super high sample rates.- Members:
i (
In(stream.Signature(shape))) – Continuous time-domain input streamo (
Out(stream.Signature(shape))) – Continuous time-domain output stream (reconstructed)o_freq (
Out(stream.Signature(Block(CQ(shape))))) – Frequency-domain output blocks for processing by user logici_freq (
In(stream.Signature(Block(CQ(shape))))) – Frequency-domain input blocks after processing by user logic
- __init__(shape, sz)
- shapefixed.SQ
Shape of the fixed-point samples used for inputs and outputs.
- szint
Size of the frequency-domain blocks used internally and exposed for frequency domain processing.