# Experiment Timeline

## Overview

| Phase | Name | Outcome |
|-------|------|---------|
| Baseline | Official Gemmini attention | DRAM materializes N×N attention matrix |
| **Exp 1** | Local-store fusion | Wins @ seq=128; fails to scale |
| Exp 2/3 | Normalizer online softmax | **Deprecated** — abandoned |
| **Exp 4** | OnlineAttention HW | BERT-scale integration (phases 4B–4M) |
| **Exp 5** | Clean profiling | Definitive reported numbers |

All fair comparisons use the **same** modified Gemmini Verilator binary. Baseline software uses stock `SOFTMAX` path (not Exp 2 Normalizer commands).

---

## Baseline

**Benchmark:** `benchmarks/exp1/official_attention_baseline.c`

Pipeline: QKV → QK^T + HW softmax → **mvout attn to DRAM** → mvin for PV → Wo → LayerNorm.

At seq=128: ~2.15M total cycles; ~393 KiB/layer wasted DRAM round-trip for int8 attention weights.

---

## Experiment 1 — Fused data movement

**Benchmark:** `benchmarks/exp1/fused_attention_local_benchmark.c`

**Idea:** Route softmax output to scratchpad (`GEMMINI_LOCAL_STORE_FLAG`); PV uses `A=NULL`.

| Seq | Attention core | Total | Notes |
|-----|----------------|-------|-------|
| 128 | 358K | 1.88M | ~1.51× attn vs baseline |
| 256 | ~6.68M | ~9.5M | Full-row softmax + Q_BLOCK=16 → O(N²) dispatches |
| 512 | — | **CRASH** | TileLink / control-path stress |

**Lesson:** Fusion of *data movement* was correct. Failure was **softmax mechanism** (full-row Normalizer) + **Q_BLOCK=16**, not the fusion idea.

---

## Experiments 2/3 — Deprecated

Attempted online softmax via Gemmini's existing Normalizer + `OnlineSoftmax.scala`. Problems:

- NormCmd bit-width expansion broke baseline commands
- stat_id isolation hacks
- DIM=16 forced 4× rescale passes

**Decision:** Remove all `ONLINE_*` NormCmds; build dedicated **OnlineAttention** module (Exp 4).

---

## Experiment 4 — OnlineAttention integration

Phased bring-up on git branch `experiment-2` (historical name — holds Exp 1–5 work):

| Phase | Milestone |
|-------|-----------|
| 4B | OnlineAttention FSM + 16 iexp units |
| 4C–4E | UPDATE, WEIGHTS, RESCALE micro-tests |
| 4F | Complete DIM=16 micro-kernel |
| 4G–4I | BERT-scale integration, batch RoCC ops |
| **4M** | DRAM internal MVIN, K_BLOCK=seq — best Exp 4 @ seq=128 (~1.79M) |

**Benchmark (canonical Exp 4):** `benchmarks/exp4/online_attn_bert_k128_m4m.c`

---

## Experiment 5 — Definitive measurement

**Benchmarks:**
- `benchmarks/exp5/online_attn_exp5_full.c` — full sublayer
- `benchmarks/exp5/online_attn_exp5_sweep.c` — Q_BLOCK design-space

Improvements over Exp 4:
- `OP_FUSED_BATCH` (one RoCC cmd for UPDATE+WEIGHTS)
- Zero `printf` inside timed regions
- Optimal Q_BLOCK per seq from sweep

### Q_BLOCK sweep (attention core only)

| Seq | Optimal Q_BLOCK | vs Q_BLOCK=16 |
|-----|-----------------|---------------|
| 128 | 128 | −36.4% |
| 256 | 256 | −50.7% |
| 512 | 128 | −47.9% (Q=256 regresses — ACC overflow U-curve) |

### Full pipeline results

| Seq | Total cycles | vs baseline |
|-----|--------------|-------------|
| 128 | **1.63M** | 1.32× |
| 256 | **3.27M** | 1.22× |
| 512 | **7.43M** | 1.12× |

Attention matrix **never** touches DRAM in Exp 5.

---

## Three design lessons

1. **Q_BLOCK granularity is king** — dispatch overhead dominates once DRAM traffic is removed.
2. **Fused data movement was right early** — Experiment 1 validated the traffic analysis.
3. **Dedicated hardware beats overloading shared units** — Normalizer vs OnlineAttention.
