YOLO26n TensorRT Profiling on Jetson

Summary of findings

1. TensorRT version (10.3 vs 10.7)

On the matched FP16 NMS-free compact export (P3 vs P1), TensorRT 10.7 increases end-to-end latency by +1.44 ms. Nsight Compute indicates individual xmma kernels execute faster per launch on 10.7, but the engine schedules six times more launches — a graph-scheduling regression rather than uniformly slower kernels.

On dense-output routes, engine time is effectively unchanged (+0.01–0.02 ms infer). The remaining ~0.5 ms total difference is attributable to host and runtime boundary overhead.

2. INT8 quantization

INT8 reduces TensorRT inference time by approximately 0.6 ms on matched dense pairs (e.g. P9 vs P2). However, dense post-processing (~2.8 ms) dominates the pipeline; INT8 dense (P2, 8.80 ms) remains slower than FP16 compact (P1, 7.80 ms).

INT8 achieves the lowest measured end-to-end latency only when combined with compact Route C (P5, 7.64 ms), at a measurable accuracy cost (mAP50-95: 0.397 → 0.362).

All runtime figures are five-run averages over 5,000 COCO val2017 images at batch-1. Framework-reported inference time excludes preprocess, output layout, and Python post-processing.

Experimental design

Platform: Jetson Orin Nano Super, MAXN power mode
TensorRT 10.3 (container) vs 10.7 (host)
Model export: Ultralytics 8.4.39 → TensorRT engines
Control baselines: YOLOv8n dense pipelines (P6, P7, P14, P15)
P11, P13 (INT8 engine NMS): engine build failed on both TRT versions

Output routes

YOLO26 is NMS-free at the model level, but the exported engine contract still determines post-processing cost.

Route	Output contract	Selection mechanism
A — dense	`[1, 84, 8400]`	Application-side decode
B — engine NMS	Compact detections	TensorRT NMS plugin
C — NMS-free	Compact end-to-end	TopK / Gather in engine graph

Route comparison

Controlled comparison: YOLO26n, FP16, TensorRT 10.3. Output route has greater impact on end-to-end latency than precision alone.

Route C · P17.80 ms

Route B · P128.72 ms

Route A · P99.52 ms

Route	Pipeline	Infer (ms)	Post (ms)	Total (ms)
C	P1	4.49	1.10	7.80
B	P12	5.44	1.10	8.72
A	P9	4.44	2.86	9.52

P9 matches P1 on engine time; the 1.72 ms end-to-end penalty is entirely post-processing. NSYS reports 23.8 additional kernel launches per image on the dense route.

Full pipeline matrix

Five-run average · batch-1 · 640×640 · COCO val2017

ID	Model	TRT	Prec	Rt	Pre	Infer	Post	Total	mAP50	mAP50-95
P1	YOLO26n	10.3	FP16	C	2.20	4.49	1.10	7.80	.551	.397
P2	YOLO26n	10.3	INT8	A	2.20	3.81	2.80	8.80	.538	.373
P3	YOLO26n	10.7	FP16	C	2.42	5.64	1.16	9.24	.551	.397
P4	YOLO26n	10.7	INT8	A	2.40	3.82	3.10	9.30	.538	.373
P5	YOLO26n	10.7	INT8	C	2.40	4.11	1.14	7.64	.517	.362
P6	YOLOv8n	10.3	FP16	A	2.20	4.19	2.86	9.26	.518	.368
P7	YOLOv8n	10.3	INT8	A	2.20	3.32	2.84	8.34	.503	.354
P8	YOLO26n	10.7	FP16	A	2.40	4.46	3.18	10.06	.562	.403
P9	YOLO26n	10.3	FP16	A	2.20	4.44	2.86	9.52	.562	.403
P10	YOLO26n	10.7	FP16	B	2.42	5.49	1.20	9.12	.439	.333
P11	YOLO26n	10.7	INT8	B	—	—	—	build failed	—	—
P12	YOLO26n	10.3	FP16	B	2.20	5.44	1.10	8.72	.439	.332
P13	YOLO26n	10.3	INT8	B	—	—	—	build failed	—	—
P14	YOLOv8n	10.7	FP16	A	2.40	4.17	3.02	9.62	.518	.368
P15	YOLOv8n	10.7	INT8	A	2.40	3.41	3.00	8.80	.497	.350

INT8 precision analysis

INT8 consistently reduces engine inference time. On Route A (dense output), post-processing limits end-to-end benefit.

Configuration	Infer	Post	Total	Observation
P9 FP16 dense (TRT 10.3)	4.44	2.86	9.52	Dense baseline
P2 INT8 dense (TRT 10.3)	3.81	2.80	8.80	Lower infer; post-dominated
P1 FP16 Route C (TRT 10.3)	4.49	1.10	7.80	Route outperforms INT8 dense
P5 INT8 Route C (TRT 10.7)	4.11	1.14	7.64	Lowest measured latency

Accuracy on Route C: mAP50-95 decreases from 0.397 (P1, FP16) to 0.362 (P5, INT8).

TensorRT version comparison

Matched-pair analysis across TensorRT 10.3 and 10.7. A single engine-level regression appears on Route C FP16; other pairs show boundary overhead.

Pair	Δ infer (ms)	Δ total (ms)	Interpretation
P3−P1 (C FP16)	+1.15	+1.44	Engine scheduling regression
P4−P2 (A INT8)	+0.01	+0.50	Boundary overhead
P8−P9 (A FP16)	+0.02	+0.54	Boundary overhead
P10−P12 (B FP16)	+0.05	+0.40	Boundary overhead

NSYS kernel-family analysis (P3 vs P1)

Aggregated over 5,000 images. Total kernel time increases by 5,605 ms on P3 relative to P1.

Family	P1 (ms)	P3 (ms)	Δ (ms)
xmma	11431.7	15651.5	+4219.8
pointwise	1438.5	2604.5	+1166.0
CUTENSOR	2281.0	2829.6	+548.6

Device-to-device copies: P1 ~0 MiB vs P3 3,907 MiB across 5,008 calls.

Nsight Compute analysis

TensorRT 10.3 favours tilesize128x32x32 (110 registers); 10.7 prefers 256x128x64 (208 registers). P3 exposes 55 xmma kernel variants vs 37 on P1.

Metric	P1	P3
Matched xmma duration	62.8 µs	41.1 µs
L2 hit	67.6%	89.2%
Launches (that kernel)	5001	30006

Pointwise /model.0/act: P3 113.8 µs, 42% occ, local mem — P1 76.4 µs, 94% occ. Only per-kernel regression.

Full NCU tables and methodology: report (PDF) · code & data on GitHub.

Deployment recommendations

Latency-constrained deployment: INT8 with Route C on TensorRT 10.7 (P5, 7.64 ms). Validate accuracy on target data or recalibrate INT8 quantization.

Accuracy-constrained deployment: FP16 with Route C on TensorRT 10.3 (P1, mAP50-95 0.397). Re-profile end-to-end before any TensorRT upgrade.

General principle: Neither TensorRT version nor INT8 quantization should be adopted without measuring the complete inference pipeline on the intended output contract.

Full analysis: final report (PDF) · GitHub repository.

Artifacts

Full write-up, reproducibility materials, and profiling scripts.

Report

COMP6211L final report (PDF, 9 pages)

Code

GitHub repository — benchmarks, engines, NSYS/NCU logs

Grade: A+ (53/60) · Proposal 5/5 · Reproducibility & Code Quality 10/10