Summary of findings

1. TensorRT version (10.3 vs 10.7)

On the matched FP16 NMS-free compact export (P3 vs P1), TensorRT 10.7 increases end-to-end latency by +1.44 ms. Nsight Compute indicates individual xmma kernels execute faster per launch on 10.7, but the engine schedules six times more launches — a graph-scheduling regression rather than uniformly slower kernels.

On dense-output routes, engine time is effectively unchanged (+0.01–0.02 ms infer). The remaining ~0.5 ms total difference is attributable to host and runtime boundary overhead.

2. INT8 quantization

INT8 reduces TensorRT inference time by approximately 0.6 ms on matched dense pairs (e.g. P9 vs P2). However, dense post-processing (~2.8 ms) dominates the pipeline; INT8 dense (P2, 8.80 ms) remains slower than FP16 compact (P1, 7.80 ms).

INT8 achieves the lowest measured end-to-end latency only when combined with compact Route C (P5, 7.64 ms), at a measurable accuracy cost (mAP50-95: 0.397 → 0.362).

All runtime figures are five-run averages over 5,000 COCO val2017 images at batch-1. Framework-reported inference time excludes preprocess, output layout, and Python post-processing.

Experimental design

  • Platform: Jetson Orin Nano Super, MAXN power mode
  • TensorRT 10.3 (container) vs 10.7 (host)
  • Model export: Ultralytics 8.4.39 → TensorRT engines
  • Control baselines: YOLOv8n dense pipelines (P6, P7, P14, P15)
  • P11, P13 (INT8 engine NMS): engine build failed on both TRT versions

Output routes

YOLO26 is NMS-free at the model level, but the exported engine contract still determines post-processing cost.

RouteOutput contractSelection mechanism
A — dense[1, 84, 8400]Application-side decode
B — engine NMSCompact detectionsTensorRT NMS plugin
C — NMS-freeCompact end-to-endTopK / Gather in engine graph

Route comparison

Controlled comparison: YOLO26n, FP16, TensorRT 10.3. Output route has greater impact on end-to-end latency than precision alone.

Route C · P17.80 ms
Route B · P128.72 ms
Route A · P99.52 ms
RoutePipelineInfer (ms)Post (ms)Total (ms)
CP14.491.107.80
BP125.441.108.72
AP94.442.869.52

P9 matches P1 on engine time; the 1.72 ms end-to-end penalty is entirely post-processing. NSYS reports 23.8 additional kernel launches per image on the dense route.

Full pipeline matrix

Five-run average · batch-1 · 640×640 · COCO val2017

IDModelTRTPrecRtPreInferPostTotalmAP50mAP50-95
P1YOLO26n10.3FP16C2.204.491.107.80.551.397
P2YOLO26n10.3INT8A2.203.812.808.80.538.373
P3YOLO26n10.7FP16C2.425.641.169.24.551.397
P4YOLO26n10.7INT8A2.403.823.109.30.538.373
P5YOLO26n10.7INT8C2.404.111.147.64.517.362
P6YOLOv8n10.3FP16A2.204.192.869.26.518.368
P7YOLOv8n10.3INT8A2.203.322.848.34.503.354
P8YOLO26n10.7FP16A2.404.463.1810.06.562.403
P9YOLO26n10.3FP16A2.204.442.869.52.562.403
P10YOLO26n10.7FP16B2.425.491.209.12.439.333
P11YOLO26n10.7INT8Bbuild failed
P12YOLO26n10.3FP16B2.205.441.108.72.439.332
P13YOLO26n10.3INT8Bbuild failed
P14YOLOv8n10.7FP16A2.404.173.029.62.518.368
P15YOLOv8n10.7INT8A2.403.413.008.80.497.350

INT8 precision analysis

INT8 consistently reduces engine inference time. On Route A (dense output), post-processing limits end-to-end benefit.

ConfigurationInferPostTotalObservation
P9 FP16 dense (TRT 10.3)4.442.869.52Dense baseline
P2 INT8 dense (TRT 10.3)3.812.808.80Lower infer; post-dominated
P1 FP16 Route C (TRT 10.3)4.491.107.80Route outperforms INT8 dense
P5 INT8 Route C (TRT 10.7)4.111.147.64Lowest measured latency

Accuracy on Route C: mAP50-95 decreases from 0.397 (P1, FP16) to 0.362 (P5, INT8).

TensorRT version comparison

Matched-pair analysis across TensorRT 10.3 and 10.7. A single engine-level regression appears on Route C FP16; other pairs show boundary overhead.

PairΔ infer (ms)Δ total (ms)Interpretation
P3−P1 (C FP16)+1.15+1.44Engine scheduling regression
P4−P2 (A INT8)+0.01+0.50Boundary overhead
P8−P9 (A FP16)+0.02+0.54Boundary overhead
P10−P12 (B FP16)+0.05+0.40Boundary overhead

NSYS kernel-family analysis (P3 vs P1)

Aggregated over 5,000 images. Total kernel time increases by 5,605 ms on P3 relative to P1.

FamilyP1 (ms)P3 (ms)Δ (ms)
xmma11431.715651.5+4219.8
pointwise1438.52604.5+1166.0
CUTENSOR2281.02829.6+548.6

Device-to-device copies: P1 ~0 MiB vs P3 3,907 MiB across 5,008 calls.

Nsight Compute analysis

TensorRT 10.3 favours tilesize128x32x32 (110 registers); 10.7 prefers 256x128x64 (208 registers). P3 exposes 55 xmma kernel variants vs 37 on P1.

MetricP1P3
Matched xmma duration62.8 µs41.1 µs
L2 hit67.6%89.2%
Launches (that kernel)500130006

Pointwise /model.0/act: P3 113.8 µs, 42% occ, local mem — P1 76.4 µs, 94% occ. Only per-kernel regression.

Full NCU tables and methodology: report (PDF) · code & data on GitHub.

Deployment recommendations

Latency-constrained deployment: INT8 with Route C on TensorRT 10.7 (P5, 7.64 ms). Validate accuracy on target data or recalibrate INT8 quantization.

Accuracy-constrained deployment: FP16 with Route C on TensorRT 10.3 (P1, mAP50-95 0.397). Re-profile end-to-end before any TensorRT upgrade.

General principle: Neither TensorRT version nor INT8 quantization should be adopted without measuring the complete inference pipeline on the intended output contract.

Full analysis: final report (PDF) · GitHub repository.

Artifacts

Full write-up, reproducibility materials, and profiling scripts.

Grade: A+ (53/60) · Proposal 5/5 · Reproducibility & Code Quality 10/10