Summary of findings
1. TensorRT version (10.3 vs 10.7)
On the matched FP16 NMS-free compact export (P3 vs P1), TensorRT 10.7 increases end-to-end latency by +1.44 ms. Nsight Compute indicates individual xmma kernels execute faster per launch on 10.7, but the engine schedules six times more launches — a graph-scheduling regression rather than uniformly slower kernels.
On dense-output routes, engine time is effectively unchanged (+0.01–0.02 ms infer). The remaining ~0.5 ms total difference is attributable to host and runtime boundary overhead.
2. INT8 quantization
INT8 reduces TensorRT inference time by approximately 0.6 ms on matched dense pairs (e.g. P9 vs P2). However, dense post-processing (~2.8 ms) dominates the pipeline; INT8 dense (P2, 8.80 ms) remains slower than FP16 compact (P1, 7.80 ms).
INT8 achieves the lowest measured end-to-end latency only when combined with compact Route C (P5, 7.64 ms), at a measurable accuracy cost (mAP50-95: 0.397 → 0.362).
All runtime figures are five-run averages over 5,000 COCO val2017 images at batch-1. Framework-reported inference time excludes preprocess, output layout, and Python post-processing.
Experimental design
- Platform: Jetson Orin Nano Super, MAXN power mode
- TensorRT 10.3 (container) vs 10.7 (host)
- Model export: Ultralytics 8.4.39 → TensorRT engines
- Control baselines: YOLOv8n dense pipelines (P6, P7, P14, P15)
- P11, P13 (INT8 engine NMS): engine build failed on both TRT versions
Output routes
YOLO26 is NMS-free at the model level, but the exported engine contract still determines post-processing cost.
| Route | Output contract | Selection mechanism |
|---|---|---|
| A — dense | [1, 84, 8400] | Application-side decode |
| B — engine NMS | Compact detections | TensorRT NMS plugin |
| C — NMS-free | Compact end-to-end | TopK / Gather in engine graph |
Route comparison
Controlled comparison: YOLO26n, FP16, TensorRT 10.3. Output route has greater impact on end-to-end latency than precision alone.
| Route | Pipeline | Infer (ms) | Post (ms) | Total (ms) |
|---|---|---|---|---|
| C | P1 | 4.49 | 1.10 | 7.80 |
| B | P12 | 5.44 | 1.10 | 8.72 |
| A | P9 | 4.44 | 2.86 | 9.52 |
P9 matches P1 on engine time; the 1.72 ms end-to-end penalty is entirely post-processing. NSYS reports 23.8 additional kernel launches per image on the dense route.
Full pipeline matrix
Five-run average · batch-1 · 640×640 · COCO val2017
| ID | Model | TRT | Prec | Rt | Pre | Infer | Post | Total | mAP50 | mAP50-95 |
|---|---|---|---|---|---|---|---|---|---|---|
| P1 | YOLO26n | 10.3 | FP16 | C | 2.20 | 4.49 | 1.10 | 7.80 | .551 | .397 |
| P2 | YOLO26n | 10.3 | INT8 | A | 2.20 | 3.81 | 2.80 | 8.80 | .538 | .373 |
| P3 | YOLO26n | 10.7 | FP16 | C | 2.42 | 5.64 | 1.16 | 9.24 | .551 | .397 |
| P4 | YOLO26n | 10.7 | INT8 | A | 2.40 | 3.82 | 3.10 | 9.30 | .538 | .373 |
| P5 | YOLO26n | 10.7 | INT8 | C | 2.40 | 4.11 | 1.14 | 7.64 | .517 | .362 |
| P6 | YOLOv8n | 10.3 | FP16 | A | 2.20 | 4.19 | 2.86 | 9.26 | .518 | .368 |
| P7 | YOLOv8n | 10.3 | INT8 | A | 2.20 | 3.32 | 2.84 | 8.34 | .503 | .354 |
| P8 | YOLO26n | 10.7 | FP16 | A | 2.40 | 4.46 | 3.18 | 10.06 | .562 | .403 |
| P9 | YOLO26n | 10.3 | FP16 | A | 2.20 | 4.44 | 2.86 | 9.52 | .562 | .403 |
| P10 | YOLO26n | 10.7 | FP16 | B | 2.42 | 5.49 | 1.20 | 9.12 | .439 | .333 |
| P11 | YOLO26n | 10.7 | INT8 | B | — | — | — | build failed | — | — |
| P12 | YOLO26n | 10.3 | FP16 | B | 2.20 | 5.44 | 1.10 | 8.72 | .439 | .332 |
| P13 | YOLO26n | 10.3 | INT8 | B | — | — | — | build failed | — | — |
| P14 | YOLOv8n | 10.7 | FP16 | A | 2.40 | 4.17 | 3.02 | 9.62 | .518 | .368 |
| P15 | YOLOv8n | 10.7 | INT8 | A | 2.40 | 3.41 | 3.00 | 8.80 | .497 | .350 |
INT8 precision analysis
INT8 consistently reduces engine inference time. On Route A (dense output), post-processing limits end-to-end benefit.
| Configuration | Infer | Post | Total | Observation |
|---|---|---|---|---|
| P9 FP16 dense (TRT 10.3) | 4.44 | 2.86 | 9.52 | Dense baseline |
| P2 INT8 dense (TRT 10.3) | 3.81 | 2.80 | 8.80 | Lower infer; post-dominated |
| P1 FP16 Route C (TRT 10.3) | 4.49 | 1.10 | 7.80 | Route outperforms INT8 dense |
| P5 INT8 Route C (TRT 10.7) | 4.11 | 1.14 | 7.64 | Lowest measured latency |
Accuracy on Route C: mAP50-95 decreases from 0.397 (P1, FP16) to 0.362 (P5, INT8).
TensorRT version comparison
Matched-pair analysis across TensorRT 10.3 and 10.7. A single engine-level regression appears on Route C FP16; other pairs show boundary overhead.
| Pair | Δ infer (ms) | Δ total (ms) | Interpretation |
|---|---|---|---|
| P3−P1 (C FP16) | +1.15 | +1.44 | Engine scheduling regression |
| P4−P2 (A INT8) | +0.01 | +0.50 | Boundary overhead |
| P8−P9 (A FP16) | +0.02 | +0.54 | Boundary overhead |
| P10−P12 (B FP16) | +0.05 | +0.40 | Boundary overhead |
NSYS kernel-family analysis (P3 vs P1)
Aggregated over 5,000 images. Total kernel time increases by 5,605 ms on P3 relative to P1.
| Family | P1 (ms) | P3 (ms) | Δ (ms) |
|---|---|---|---|
| xmma | 11431.7 | 15651.5 | +4219.8 |
| pointwise | 1438.5 | 2604.5 | +1166.0 |
| CUTENSOR | 2281.0 | 2829.6 | +548.6 |
Device-to-device copies: P1 ~0 MiB vs P3 3,907 MiB across 5,008 calls.
Nsight Compute analysis
TensorRT 10.3 favours tilesize128x32x32 (110 registers); 10.7 prefers 256x128x64 (208 registers). P3 exposes 55 xmma kernel variants vs 37 on P1.
| Metric | P1 | P3 |
|---|---|---|
| Matched xmma duration | 62.8 µs | 41.1 µs |
| L2 hit | 67.6% | 89.2% |
| Launches (that kernel) | 5001 | 30006 |
Pointwise /model.0/act: P3 113.8 µs, 42% occ, local mem — P1 76.4 µs, 94% occ. Only per-kernel regression.
Full NCU tables and methodology: report (PDF) · code & data on GitHub.
Deployment recommendations
Latency-constrained deployment: INT8 with Route C on TensorRT 10.7 (P5, 7.64 ms). Validate accuracy on target data or recalibrate INT8 quantization.
Accuracy-constrained deployment: FP16 with Route C on TensorRT 10.3 (P1, mAP50-95 0.397). Re-profile end-to-end before any TensorRT upgrade.
General principle: Neither TensorRT version nor INT8 quantization should be adopted without measuring the complete inference pipeline on the intended output contract.
Full analysis: final report (PDF) · GitHub repository.
Artifacts
Full write-up, reproducibility materials, and profiling scripts.
Grade: A+ (53/60) · Proposal 5/5 · Reproducibility & Code Quality 10/10