Rocket issues custom instructions; Gemmini handles tiled matrix operations through its controller.
On-chip operand SRAM. Matmul A/B operands must be staged here before the systolic array consumes them.
Matmul outputs land here as wider partial sums, then can be scaled, activated, normalized, or moved out.
16×16 weight-stationary mesh in this project config; this is where QK and PV compute runs.
Scratchpad feeds operands into the systolic array.
Accumulator receives QK/PV matmul results and supports scale / normalization.
With ACC_ROWS=2048, each WS buffer half has 1024 usable rows.
The denominator needs the whole row, so the standard softmax path cannot freely stream partial K-blocks.
| Tensor | Shape per head | Where it naturally lives | Why it matters |
|---|---|---|---|
| Q per head | SEQ × 64 | Scratchpad input | Left operand for QKT |
| K per head | SEQ × 64 | Scratchpad input | Right operand for QKT |
| V per head | SEQ × 64 | Scratchpad input | Right operand for PV |
| QK scores | SEQ × SEQ | Accumulator | Produced by systolic Q × KT |
| Softmax weights | SEQ × SEQ | Scratchpad for PV | Consumed as PV's A operand |
| PV output | SEQ × 64 | Accumulator / DRAM | Final per-head output |
Standard tiled matmuls produce the per-head inputs.
Normal input/output traffic.
seq=128: 960K cycles
Scores are produced and normalized on chip.
A is temporary, but baseline writes it to DRAM.
First half of round trip
PV immediately consumes the same A matrix.
A returns from DRAM just after it was written.
Second half of round trip
Rest of the Transformer block.
Not the first target in this project.
seq=128: 497K cycles
Scores: S = Q × KT
Weights: A = softmax(S)
Output: O = A × V
For BERT-base: 12 heads, so the attention matrix cost grows with 12 × SEQ².
Attention matrix A is an intermediate tensor:
| Seq | Attention Matrix | Wasted DRAM |
|---|---|---|
| 128 | 12 × 128² = 196K | 393 KB |
| 256 | 12 × 256² = 786K | 1.57 MB |
| 512 | 12 × 512² = 3.1M | 6.3 MB |
This traffic grows with O(seq²), exactly like the attention matrix itself.
The first attempt keeps the attention weights inside Gemmini:
QK and softmax still run as Gemmini steps; only the destination changes.
Hardware: route softmax mvout output directly into scratchpad.
Software: write with GEMMINI_LOCAL_STORE_FLAG; run PV with A=NULL.
Baseline (2 DRAM transfers):
Fused (0 DRAM transfers):
GEMMINI_LOCAL_STORE_FLAG| Seq Len | Baseline Attention | Exp 1 Attention | Speedup | Full Pipeline | Status |
|---|---|---|---|---|---|
| 128 | 611,290 | 357,667 | 1.71× faster | 1.88M (vs 2.15M) | WIN |
| 256 | 1,037,836 | 6,680,878 | 6.4× slower | ~9.54M (vs 3.97M) | LOSE |
| 512 | 2,623,055 | CRASH @ 204M cycles — TL assertion | CRASH | ||
DRAM round-trip cost (786 KB DMA) > local store overhead (~50K cycles for 3,072 mvout).
Attention: −41% · Full: −12.4%
Both J-tile count (SOFTMAX constraint) and Q-block count (Q_BLOCK=16) grow linearly with seq_len. They multiply.
mvout: 3,072 → 12,288 → 49,152
O(seq) × O(seq) = O(seq²) mvout storm
6.68M
Baseline: 1.04M
6.4× slower in attention core
QK+PV compute is only ~1.47M.
The rest is dispatch overhead.
Crash @ 204M
TileLink assertion
Local-store path overstressed
The bypass is correct, but the control path
can't handle 590K fine-grained writes.
| seq=128 | seq=256 | seq=512 | |
|---|---|---|---|
| mvout/head | 3,072 | 12,288 | 49,152 |
| mvout total (12 heads) | 36,864 | 147,456 | 589,824 |
| Result | 1.71× faster | 6.4× slower | Crash |
Same root cause: Q_BLOCK=16 + full-row softmax = O(seq²) mvout commands. No amount of tuning fixes the algorithmic scaling.
tiled_matmul(q_block, SEQ_LEN, HEAD_DIM,
Q, K, NULL,
(void*)(uintptr_t)GEMMINI_LOCAL_STORE_FLAG,
// C is a local-store flag, not a DRAM address
HIDDEN_DIM, HIDDEN_DIM, 0, SEQ_LEN,
MVIN_SCALE_IDENTITY, MVIN_SCALE_IDENTITY, MVIN_SCALE_IDENTITY,
SOFTMAX, ACC_SCALE_IDENTITY, 0,
false,
1, SEQ_LEN / DIM, QK_TILE_K, // tile_J = full seq
false, false, true,
false, 0, WS);
gemmini_fence();
// Softmax output now lands in scratchpad
// No attention buffer is written to DRAM
tiled_matmul(q_block, HEAD_DIM, SEQ_LEN,
NULL, V, NULL, O, // A=NULL: use existing SP weights
SEQ_LEN, HIDDEN_DIM, 0, HIDDEN_DIM,
MVIN_SCALE_IDENTITY, MVIN_SCALE_IDENTITY, MVIN_SCALE_IDENTITY,
NO_ACTIVATION, ACC_SCALE_IDENTITY, 0,
false,
1, PV_TILE_J, PV_TILE_K,
false, false, false,
false, 0, WS);
gemmini_fence();
// PV skips mvin of A; only final O is written out
| Baseline | Fused (Exp 1.2) | |
|---|---|---|
| QK^T output dest | C = attn_buf (DRAM addr) | C = (void*)GEMMINI_LOCAL_STORE_FLAG |
| Softmax mvout goes to | DRAM (via DMA Writer) | Scratchpad SRAM (via local store wire) |
| PV A operand | DRAM (via mvin: attn_buf→SP) | Scratchpad (A=NULL, already there) |
| Attn matrix in DRAM | 12 × 128² = 196 KB | 0 — never leaves chip |
Why? Softmax is row-wise:
The denominator needs all elements in the row. Partial J tiles do not have enough information.
tile_J = SEQ_LEN / DIM for built-in softmax.
The local-store bypass changes where softmax output goes, but not how the built-in softmax path is driven.
| Good | Bad |
|---|---|
| Avoids DRAM round-trip | Still many store commands |
| PV can use A=NULL | Still full-row softmax |
Context: After QKT matmul, the accumulator holds scores in int32. Built-in SOFTMAX runs inside the accumulator when data is read out via mvout — processing one row at a time, horizontally across columns.
One Row in the Accumulator (SEQ_LEN columns, int32)
N = SEQ_LEN / DIM = SEQ_LEN / 16 | Each J-tile = 16 columns processed in parallel by the accumulator lanes
| 1 | Find max | Scan all J-tiles left→right, track max value |
| 2 | Exp + Sum | Re-scan: exp(x − max) per element, accumulate Σ |
| 3 | Normalize | Re-scan: exp / Σ → scale → write out as int8 weights |
Three passes because the accumulator needs the full-row max before exp, and the full-row sum before normalize.
The denominator Σⱼ couples every column in the row. You cannot normalize J-tile 0 until you have scanned ALL J-tiles to compute the global sum.
Result: each mvout must process the entire row — 3 passes × N J-tiles — before any weight is written out.
Horizontally: softmax is a reduction (max, sum) across all columns, then element-wise normalize. The accumulator must traverse the entire row. Vertically: rows are independent — which is why Q_BLOCK lets us batch rows together. But horizontal dependency is the bottleneck that online softmax later fixes.
Exp 1 computes only 16 query rows at a time:
Q_BLOCK = DIM = 16 scores tile = 16 rows × SEQ_LEN columns
After that tile finishes, the built-in SOFTMAX/local-store path still has to emit the normalized attention weights for those 16 rows across the entire J dimension.
Important: Q_BLOCK=16 is not the thing that creates mvout. The softmax output path creates the mvout-style store. Q_BLOCK=16 makes us repeat that store loop many times.
For each Q-block:
QK^T produces 16 × SEQ_LEN scores
For each row in the Q-block: // 16 rows
For each softmax/store pass: // 3 passes
For each J-block: // SEQ_LEN / 16
gemmini_extended_mvout(...)
mvout per Q-block = 16 × 3 × (SEQ_LEN / 16)
= 48 × (SEQ_LEN / 16)
| Seq | Q-blocks/head | mvout/Q-block | mvout/head |
|---|---|---|---|
| 128 | 8 | 384 | 3,072 |
| 256 | 16 | 768 | 12,288 |
| 512 | 32 | 1,536 | 49,152 |
The repeated full-row store is the problem. The bypass removes DRAM traffic, but the CPU still drives thousands of small RoCC store operations.
The denominator Σⱼ couples all K/J positions in the row.
Result: built-in softmax wants full-row processing.
Keep running state per row:
Each J-tile updates the state, so columns stream tile-by-tile (16 at a time).
Goal: stop materializing and re-driving the full attention row through many small commands.
This proved the idea: online softmax can be implemented in Gemmini hardware. But it also exposed why modifying the Accumulator is not a clean final design.
Key point: the Accumulator is driven by existing mvout/config command fields. It is not a standalone attention engine.
gemmini_extended_mvout(...)
gemmini_config_norm(..., stat_id)
| Input | Accumulator row |
| State slots | NORM_STAT_IDS=2 |
| Compute | max → exp → sum → scale |
| Width | DIM=16 lanes |
Original: 8 commands = 3 bits
Online: 13 commands = 4 bits
stat_id=0/1: existing Accumulator state slots
Experiment 2 tried to reuse one slot for online max / sum
mvout to DRAM
Exp 1 local-store branch to scratchpad
So the Accumulator approach touches three shared mechanisms at once: command decode, state selection, and the existing output dataflow.
Original NormCmd: 8 commands = 3 bits.
Add online softmax: +5 commands → 13 commands = 4 bits.
NormCmd packed into mvout addresses makes NORM_CMD_SHIFT fragile. If SW and RTL disagree, existing SOFTMAX / LayerNorm decoding can break.
Accumulator config: NORM_STAT_IDS=2.
Online softmax plan: reserve stat_id=1 for running max / sum; use stat_id=0 for default PV / mvin.
One wrong config can overwrite the online state mid-attention.
The Accumulator works at DIM=16 lanes. But each attention head has HEAD_DIM=64, so O-rescale becomes 4 commands per row.
To make this work, we keep touching shared Accumulator / mvout / scratchpad behavior. That can change existing SOFTMAX, LayerNorm, residual, and scaling behavior.
Given that: we decided to design a dedicated OnlineAttention hardware module instead of overloading the Accumulator.
QKT output is [Q_rows × SEQ_LEN]. Hardware grid is DIM=16, so J is split into SEQ_LEN/DIM tiles of 16 columns:
Built-in SOFTMAX needs all J at once for the denominator. Each J-tile → 1 mvout command. With Q_BLOCK=16:
Splitting K means incomplete rows — the softmax denominator is partial. Built-in SOFTMAX can't handle this at all.
Online softmax can handle K-splitting via running max/sum + rescale, but for seq ≤ 512 we never need it:
K-tiling matters for seq ≥ 1024 — future work.
Exp 1 failed because J-tiling × small Q_BLOCK = O(seq²) RoCC dispatch. OnlineAttention's BATCH_WEIGHTS streams all J-tiles internally with running max/sum — one RoCC call per Q-block regardless of seq_len. K_BLOCK = SEQ_LEN eliminates K-tiling entirely (single pass).
Even with K_BLOCK=SEQ_LEN, hardware processes only DIM=16 columns at a time. Online softmax streams J-tiles with running stats:
New J-tile arrives (16 columns at a time):
Check if max increased. If yes → rescale old sum before adding new terms.
| Step | Scores (4 cols) | m | s |
|---|---|---|---|
| Init | — | −∞ | 0 |
| J-tile 1 | [3, 1, 4, 1] | 4 | Σexp(s−4) |
| J-tile 2 | [7, 2, 1, 3] | 7 | s×exp(4−7) + Σexp(s−7) |
| max ↑ | old sum rescaled by exp(old_max − new_max) | ||
| J-tile 3 | [2, 5, 1, 2] | 7 | s + Σexp(s−7) |
| max = | no rescale — just add new exp terms | ||
After all J-tiles: weight = exp(score − m) / s. BATCH_WEIGHTS does this for all rows + all J-tiles in one RoCC command.
| Design choice | Why |
|---|---|
| Scores are int32 | Systolic array outputs int32. QKT scores land directly — no FP conversion. |
| Weights are int8 | Scratchpad stores int8. PV reads A=inputType=int8. Output must fit in [0, 127]. |
| No FP math library | Baremetal C on RISC-V. No expf(). Hardware computes exp() in fixed-point combinational logic. |
| Scale before iexp | ONLINE_BERT_SCALE=0.05 compresses scores so fixed-point exp doesn't saturate. |
| Saturation = zero weight | z ≥ 32 → iexp(z) = 0. Very negative → ~0. Correct softmax behavior. |
Papers assume FP32. In hardware, number formats come from the datapath.
Every op — multiply, max, exp, sum, normalize — is fixed-point int32.
Computes 2ax² + bx + c — second-order polynomial approximation of exp():
x (int32) → ax² + bx + c → 2result (int32)
| Config register | Purpose | Set via |
|---|---|---|
iexp_qln2_reg | ln(2) in fixed-point | gemmini_oa_config(0, ...) |
iexp_qln2_inv_reg | 1/ln(2) for base conversion | gemmini_oa_config(1, ...) |
iexp_qb_reg | Coefficient b | gemmini_oa_config(2, ...) |
iexp_qc_reg | Coefficient c | gemmini_oa_config(3, ...) |
The only compute in OnlineAttention. Everything else is state management and data movement.
OnlineAttention.scala — ~700 lines of Chisel. One module, one RoCC function (funct=23), one 7-state FSM. Lives between accumulator and scratchpad.
| ① | Read 16 int32 scores from accumulator |
| ② | 16-way max reduction → chunk_max |
| ③ | Update running max / sum (rescale if max increased) |
| ④ | Re-read scores, 16 × iexp(score − max) in parallel |
| ⑤ | Clamp to [0,127], pack 16 × int8 |
Streams J-tiles (16 columns at a time). Per row: tracks running max m and sum s. When max increases, rescales old sum by exp(old_max − new_max). No full-row wait — columns arrive tile-by-tile.
Re-reads same scores. Subtracts final max. 16 parallel iexp units. Clamps to [0,127], packs 16×int8, writes to scratchpad. One cycle per 16-element chunk.
gemmini_fence(). |
State: 3 KB registers (128 rows × 3 × int32) |
iexp: 16 parallel combinational units, reuses AccumulatorScale.iexp()
Concrete example: one Q row, 48 K positions streamed as three 16-element chunks. Two numbers of state per row.
| Chunk | 16 scores (conceptual) | chunk_max | m update | s update |
|---|---|---|---|---|
| 1 | [3, 1, 4, 1, ..., 9, 7, 9, 3] | 9 | −∞ → 9 | s = Σ exp(score − 9) (first chunk, compute normally) |
| 2 | [2, 11, 1, 8, ..., 9, 2, 6, 5] | 11 | 9 → 11 | s = s_old × exp(9−11) + Σ exp(score − 11) (max increased: rescale old sum first!) |
| 3 | [3, 5, 8, 10, 7, 9, 3, ..., 6, 4, 3] | 10 | 11 (no change) | s += Σ exp(score − 11) (max unchanged: just add new terms) |
Both do the same math — QKT + softmax + PV. The difference is where data lives.
attn_buf[heads][SEQ][SEQ] — the O(N²) DRAM buffer. No SOFTMAX activation flag. No separate weight MVIN/MVOUT.
gemmini_oa_fused_batch() — one RoCC command. A=NULL in PV tells hardware to read weights from scratchpad. Scores live in accumulator between QKT and softmax.
Attention matrix never touches DRAM. Scores stay in accumulator. Weights go accumulator → scratchpad directly.
| Seq | Q_BLOCK | QKV | Attention | Wo | Norm | Total |
|---|---|---|---|---|---|---|
| 128 | 128 | 913K | 245K | 309K | 164K | 1,631K |
| 256 | 256 | 1,804K | 486K | 609K | 369K | 3,269K |
| 512 | 128 | 3,582K | 1,882K | 1,204K | 759K | 7,427K |
Attention grows from 15% to 25% of total — O(N²) in action.
Six key phases. Each attack a different bottleneck. All numbers at seq=128 for fair comparison.
| Phase | What Changed | Total | vs Baseline |
|---|---|---|---|
| Baseline | Official Gemmini: QK^T + SOFTMAX + PV, attn matrix through DRAM | 2,149K | 1.00× |
| Exp 1.2 | Local-store fusion. Q_BLOCK=16. No online softmax. Works at 128, fails at scale | 1,882K | 1.14× |
| 4G→4H | Per-row ops → batch ops. 242→9 RoCC/K-block (16× reduction). First integrated pipeline | 30,244K→15,353K | 0.07→0.14× |
| 4I | K_BLOCK=128 (single-pass). Eliminate K-block loop + RESCALE. Online softmax proves correct | 6,995K | 0.31× |
| 4L→4M | Q_BLOCK=128. Single Q-block per head. Remove CPU from critical path. DRAM internal MVIN | 3,103K→1,793K | 0.69→1.20× |
| Exp 5 | Optimal Q_BLOCK (sweep-verified per seq). OP_FUSED_BATCH. Clean measurement (zero printf) | 1,631K | 1.32× |
3 RoCC commands per head. Zero CPU extraction. Zero DRAM for attn matrix.
QKT produces a [seq_len × seq_len] score matrix. Q_BLOCK = how many Q rows we process per batch.
Each Q-block does: QK^T → online softmax → PV. More Q-blocks = more RoCC commands = more dispatch overhead.
Why only sweep Q_BLOCK? Online softmax handles K streaming — K_BLOCK = SEQ_LEN always. Only Q_BLOCK is a free variable.
| Q_BLOCK | Q-blocks @ seq=256 | RoCC cmds/head | ACC tiles needed |
|---|---|---|---|
| 16 | 16 | 48 | 4 |
| 64 | 4 | 12 | 16 |
| 128 | 2 | 6 | 32 |
| 256 | 1 | 3 | 64 (full half) |
Scores fit easily in ACC (4 tiles). But many RoCC commands — each with dispatch overhead. Exp 1: Q_BLOCK=16 → 384 matmuls at seq=256 → 6.4× slower.
Few RoCC commands (3/head). But ACC overflow: scores exceed one ACC half → DMA spill to DRAM. At seq=512: 8× overflow → tipping point.
Experiment 5 sweeps Q_BLOCK at each seq_len. Clean measurement — zero printf, config outside timing. K_BLOCK = SEQ_LEN always.
| seq=128 (K_BLOCK=128) | |||
|---|---|---|---|
| Q_BLOCK | Q-blocks | Attention | vs Best |
| 16 | 8 | 432K | +79% |
| 64 | 2 | 262K | +7.5% |
| 128 | 1 | 244K | — |
| seq=256 (K_BLOCK=256) | |||
|---|---|---|---|
| Q_BLOCK | Q-blocks | Attention | vs Best |
| 16 | 16 | 1,338K | +175% |
| 128 | 2 | 512K | +5.4% |
| 256 ★ | 1 | 486K | — |
| seq=512 (K_BLOCK=512) | |||
|---|---|---|---|
| Q_BLOCK | Q-blocks | Attention | vs Best |
| 16 | 32 | 3,492K | +92% |
| 128 | 4 | 1,825K | — |
| 256 | 2 | 1,883K | +3.2% |
onlineAttentionMaxRows=256 enables single Q-block without sub-block splitting.Experiment 5 — clean measurement (zero printf, config outside timing, OP_FUSED_BATCH). Optimal Q_BLOCK per seq_len.
| Seq | Experiment | QKV | Attention | Wo | Norm | Total | vs Baseline | |
|---|---|---|---|---|---|---|---|---|
| 128 | Baseline | 960K | 611K | 309K | 188K | 2,149K | 1.00× | |
| 128 | Exp 1 (local store) | 962K | 358K | 309K | 176K | 1,882K | 1.14× | |
| 128 | Exp 5 | 913K | 245K | 309K | 164K | 1,631K | 1.32× | |
| 256 | Baseline | 1,851K | 1,038K | 610K | 395K | 3,973K | 1.00× | |
| 256 | Exp 1 (local store) | ~1,851K | 6,681K | ~610K | ~395K | ~9,536K | 0.42× | |
| 256 | Exp 5 | 1,804K | 486K | 609K | 369K | 3,269K | 1.22× | |
| 512 | Baseline | 3,629K | 2,623K | 1,204K | 801K | 8,336K | 1.00× | |
| 512 | Exp 1 (local store) | CRASH @ 204M cycles — TL assertion | ||||||
| 512 | Exp 5 | 3,582K | 1,882K | 1,204K | 759K | 7,427K | 1.12× | |
| 128→256 | 256→512 | |
|---|---|---|
| Baseline | 1.85× | 2.10× |
| Exp 1 | 5.07× | CRASH |
| Exp 5 | 2.00× | 2.27× |
| Metric | Baseline | Exp 5 | Improvement |
|---|---|---|---|
| seq=128 Total | 2,149K | 1,631K | −24.1% |
| seq=256 Total | 3,973K | 3,269K | −17.7% |
| seq=512 Total | 8,336K | 7,427K | −10.9% |
| Attn matrix DRAM | O(seq²) per head | Zero | Eliminated |
| Scalability | O(seq²) DRAM | O(seq) streaming | Arbitrary seq |
| OnlineAttention.scala | ~500 lines Chisel |
| Controller + Scratchpad changes | ~85 lines |
| State registers | 256 × 3 × int32 = 3 KB |
| iexp units | 16 parallel combinational (reused from AccumulatorScale) |
| RoCC functs | 23 (cmd) + 24 (reserved for macro-sequencer) |
Chain QKT → softmax → PV into one hardware-triggered sequence
Fuse attention output directly into FFN input
Generalize the pattern to any producer → consumer pair
GEMMINI_LOCAL_CHAIN flag on output tensorMap Flash Attention tiling onto Gemmini's systolic array