ANALYZER READY·PyTorch 2.5·CUDA 12.4·A100 SXM4·last scan 4.2s
Memory Layout Advisor · Fournex

Stop losing GPU throughput
to bad memory layout.

Many slowdowns aren't compute-bound — they're non-contiguous tensors, layout churn, and missed NHWC fast-paths. Fournex detects, explains, and safely tests the fixes.

~22%BW overhead
38×copies / step
3 fixesquick wins
fournex — analyze
$ fournex analyze --model resnet50_train --trace ./profile.json
Analyzing 1,428 ops across 12 forward passes...
Detected 4 layout issues (2 HIGH · 2 MEDIUM)
38 contiguous() calls/step (~22% bandwidth overhead)
Top fix: move .contiguous() before loop entry → saves ~19% BW
ops scanned 1,428forward passes 12analysis time 4.2smodel resnet50_train
Diagnostics · v1.0

Six memory layout problems. All GPU-throughput-killing.

We inspect tensor shapes, strides, contiguity, operator traces, and profiler bandwidth events to identify layout decisions that silently cap your effective GPU bandwidth.

0
OPS INSPECTED
0×
COPIES / STEP
0%
BW OVERHEAD
0
QUICK WINS
HIGH+19% BW
Hidden copies before hot ops

Non-contiguous tensor detection

Identifies tensors passed to matmul, Conv, and attention whose strides are non-contiguous, causing PyTorch to silently insert a .contiguous() copy and inflate memory traffic.

BW IMPACT78%
FREQ · 38×/step
HIGH+12% BW
Repeated layout conversion overhead

Contiguous call churn

Counts .contiguous() calls across the operator trace and flags call sites that could be eliminated by stabilising layout once earlier in the forward pass.

BW IMPACT58%
FREQ · 22×/step
MED−8% copy
Excessive view-only reshaping

Permute / transpose storm

Detects high-frequency permute() and transpose() chains that change logical layout without realising tensors, and identifies where a single change upstream removes the whole chain.

BW IMPACT42%
FREQ · 12×/step
MED+18% tput
Suboptimal layout for CNN workloads

NHWC / channels-last opportunity

Compares input strides against cuDNN fast-paths. Flags convolution and pooling operators where switching to memory_format=torch.channels_last could unlock native NHWC kernels.

BW IMPACT65%
FREQ · CNN ops
MEDLow AI
Low arithmetic intensity, high BW pressure

Memory-bound kernel identification

Cross-references profiler bandwidth events against the theoretical roofline to isolate kernels that are memory-bound due to poor spatial locality or fragmented allocation patterns.

BW IMPACT50%
FREQ · 3 kernels
HIGHrecompile
Graph breaks from dynamic strides

torch.compile / CUDA Graph stability

Detects tensors whose strides change between steps — a common cause of recompilation in torch.compile and broken CUDA Graph capture — and recommends where to pin layout.

BW IMPACT85%
FREQ · each step
Sample output

From operator trace
to ranked fix list.

Every finding maps a specific tensor, call site, or operator sequence to a concrete fix — with expected bandwidth savings, effort score, and a risk tier before you touch a line of code.

  • Call-site–level attribution, not just op-level summaries
  • Ranked by expected memory traffic reduction
  • Effort scored: one-liner fixes surface first
  • Safety tier: each fix tagged LOW / MED / HIGH risk
layout-advisor / resnet50_train
4 findings
Layout copies
38/step
BW overhead
~22%
Quick wins
3
#FINDING / FIXIMPACTRISK
01
Non-contiguous tensor passed to matmul at step 142
Move .contiguous() before loop entry, not inside loop body
−19% BWLow
02
CNN workload using NCHW — cuDNN NHWC path available
model.to(memory_format=torch.channels_last)
+18% tputMedium
03
12× permute() / transpose() chain before attention projection
Store QKV in target layout at projection output
−8% copyLow
04
Dynamic strides causing torch.compile graph breaks
Pin layout with .contiguous() before compile boundary
eliminates recompileLow
Memory layout analysis complete · 4.2s · resnet50_train
Roadmap

Advisor first. Autopilot when it's earned.

Changing memory layout is high-leverage but high-risk. We walk the graduation from diagnosis to safe controlled trials to full autopilot — with validation gates at every step.

01 · MVPShipping now

Layout Diagnostics

  • Non-contiguous tensor detection in hot paths
  • Repeated .contiguous() call flagging
  • Stride pattern analysis for common ops
  • Channels-last opportunity scan
02 · V1In progress

Fix Recommendations

  • Suggest channels_last for CNN workloads
  • Layout stabilisation points upstream of hot ops
  • Ranked by expected bandwidth savings
  • Effort and compatibility scoring
03 · V2Early access

Controlled Layout Trials

  • Safe A/B layout experiments with guardrails
  • Throughput + output equivalence validation
  • Memory delta and kernel selection comparison
  • Rollback on divergence or regression
04 · AutopilotOn the roadmap

Memory Layout Autopilot

  • Policy-driven layout optimisation
  • Continuous adaptation across training runs
  • Custom op and distributed-safe guardrails
  • Auditable change log per layout decision
Safety model

Layout changes can break things. We know that.

Changing tensor layout can affect numerics, kernel selection, memory use, distributed behaviour, and graph compilation. Every autopilot-mode change is validated against all of these before promotion.

Validation pipeline
Numerics
tolerance check
Kernel IDs
dispatch verify
Graph compile
stride stability
Distributed
collective guards
Promote
safe to apply

Numerics

Layout changes can alter operator dispatch and precision paths. Every recommendation is validated against output tolerance before promotion.

Kernel selection

Different layouts can route to different CUDA/cuDNN kernels. Fournex captures kernel IDs before and after to detect unexpected dispatch changes.

Custom ops & distributed

Custom CUDA kernels and distributed collectives may have layout expectations. The advisor flags these and conservatively skips unsafe changes.

Graph compilation

torch.compile and CUDA Graphs are sensitive to stride changes. Layout trials run inside the same compilation mode as your baseline to catch breaks early.

The advisor is conservative by design. We skip any layout change we can't validate end-to-end. Recommendations are separated from benchmarked trials — you choose when to escalate from advice to action.

Read the safety model
Memory Layout Advisor · Early access

Find out where your model is fighting its own memory layout.

Run a free layout diagnostic on your PyTorch training loop. See non-contiguous tensors, layout conversion overhead, and channels-last opportunities — in minutes.

PyTorch + NVIDIANo code changes to onboardApache 2.0 license