# Architecture: OnlineAttention on Gemmini

## System context

Gemmini attaches to a Rocket core via **RoCC**. Operand tiles live in **scratchpad** (int8); matmul partial sums live in the **accumulator** (int32). Row-wise softmax historically runs on the `mvout` path through the **Normalizer**.

For attention, the pain point is the boundary **accumulator → DRAM → scratchpad** between softmax and *PV*, although weights are consumed immediately.

![Gemmini system](assets/gemmini-system.png)

Config used throughout: `IBertGemminiRocketConfig` — DIM=16, ACC_ROWS=2048, WS dataflow, `onlineAttentionMaxRows=256`.

## OnlineAttention module

**File:** `hardware/gemmini/src/main/scala/gemmini/OnlineAttention.scala`  
**RoCC:** `funct=23` (`ONLINE_ATTN_CMD`); `funct=24` reserved for future macro-sequencer.

### Per-row state

| Register | Role |
|----------|------|
| `max_state` | Running row maximum (online softmax) |
| `sum_state` | Running exp sum, rescaled on max update |
| `rescale_mul` | Fixed-point rescale factor for partial outputs |
| `valid` | Row has seen at least one K-block |

Up to `onlineAttentionMaxRows` rows (256 in final config).

### Key operations

| Op | Purpose |
|----|---------|
| `BATCH_UPDATE` / `BATCH_WEIGHTS` | Scan acc score tiles; update state; write int8 weights to SP |
| **`OP_FUSED_BATCH`** | Single RoCC cmd: UPDATE → WEIGHTS per row (Exp 5) |
| `RESET` | Clear state between heads |
| `LOAD_STATE` / `SAVE_STATE` | Persist state across Q-blocks |
| `ACCUMULATE` | Merge partial PV outputs with rescale |
| `CONFIG` | iexp scale constants |

16 parallel `AccumulatorScale.iexp()` units process DIM-wide score chunks.

### Integration (`Controller.scala`)

- RoCC dispatch: `funct=23` → `online_attn.io.cmd`
- **Accumulator port sharing:** Arbiter on read_req; busy-based mux on read_resp; Arbiter on write
- **Scratchpad write sharing:** OnlineAttention has priority on SP write ports when active
- **Busy OR-tree** with execute controller

### Scratchpad raw bypass (`Scratchpad.scala`)

Standalone accumulator reads (non-DMA) bypass `write_norm_q` / `acc_scale` pipeline. Required because `gemmini_fence()` flushes the norm queue — OnlineAttention reads int32 scores via `bits.data`, not `bits.full_data` (int8 lanes).

### DMACommandTracker tolerance

After large QKV matmuls, stale DMA responses can arrive for freed tracker entries. Tolerant mode silently ignores them instead of asserting — required for stable attention + Wo sequences.

## Experiment 5 software pipeline (per head)

```
1. QK^T:  gemmini_loop_ws → accumulator (int32 scores)
2. OP_FUSED_BATCH: online max/sum + iexp → int8 weights in scratchpad
3. PV:    gemmini_loop_ws, A=NULL, V from DRAM internal MVIN
```

Three RoCC commands per head in the attention core (plus matmul loops). Zero CPU score extraction in the timed path.

## Software API (`gemmini.h`)

```c
gemmini_oa_reset();
gemmini_oa_fused_batch(num_rows, total_cols, sp_addr, scores_acc_addr);
gemmini_oa_config(reg_sel, value);
// Also: batch_update, batch_weights, load_state, save_state, accumulate, rescale
```

RoCC encoding: `k_ONLINE_ATTN` (funct 23).

## What we did *not* use

**Experiment 2/3** extended the shared Normalizer with `ONLINE_*` NormCmds — abandoned (ISA conflicts, stat_id hacks). Those commands are **removed** in the final tree; online softmax lives only in OnlineAttention.