ANALYZER READY·PyTorch 2.5·CUDA 12.4·A100 SXM4·last scan 4.2s

Memory Layout Advisor · Fournex

Stop losing GPU throughput
to bad memory layout.

Many slowdowns aren't compute-bound — they're non-contiguous tensors, layout churn, and missed NHWC fast-paths. Fournex detects, explains, and safely tests the fixes.

~22%BW overhead

38×copies / step

3 fixesquick wins

Run a free layout scan See what we detect

fournex — analyze

$ fournex analyze --model resnet50_train --trace ./profile.json

Analyzing 1,428 ops across 12 forward passes...

▲ Detected 4 layout issues (2 HIGH · 2 MEDIUM)

▲ 38 contiguous() calls/step (~22% bandwidth overhead)

→ Top fix: move .contiguous() before loop entry → saves ~19% BW

ops scanned 1,428forward passes 12analysis time 4.2smodel resnet50_train

Diagnostics · v1.0

Six memory layout problems. All GPU-throughput-killing.

We inspect tensor shapes, strides, contiguity, operator traces, and profiler bandwidth events to identify layout decisions that silently cap your effective GPU bandwidth.

OPS INSPECTED

0×

COPIES / STEP

BW OVERHEAD

QUICK WINS

HIGH+19% BW

Hidden copies before hot ops

Non-contiguous tensor detection

Identifies tensors passed to matmul, Conv, and attention whose strides are non-contiguous, causing PyTorch to silently insert a .contiguous() copy and inflate memory traffic.

BW IMPACT78%

FREQ · 38×/step

HIGH+12% BW

Repeated layout conversion overhead

Contiguous call churn

Counts .contiguous() calls across the operator trace and flags call sites that could be eliminated by stabilising layout once earlier in the forward pass.

BW IMPACT58%

FREQ · 22×/step

MED−8% copy

Excessive view-only reshaping

Permute / transpose storm

Detects high-frequency permute() and transpose() chains that change logical layout without realising tensors, and identifies where a single change upstream removes the whole chain.

BW IMPACT42%

FREQ · 12×/step

MED+18% tput

Suboptimal layout for CNN workloads

NHWC / channels-last opportunity

Compares input strides against cuDNN fast-paths. Flags convolution and pooling operators where switching to memory_format=torch.channels_last could unlock native NHWC kernels.

BW IMPACT65%

FREQ · CNN ops

MEDLow AI

Low arithmetic intensity, high BW pressure

Memory-bound kernel identification

Cross-references profiler bandwidth events against the theoretical roofline to isolate kernels that are memory-bound due to poor spatial locality or fragmented allocation patterns.

BW IMPACT50%

FREQ · 3 kernels

HIGHrecompile

Graph breaks from dynamic strides

torch.compile / CUDA Graph stability

Detects tensors whose strides change between steps — a common cause of recompilation in torch.compile and broken CUDA Graph capture — and recommends where to pin layout.

BW IMPACT85%

FREQ · each step

Sample output

From operator trace
to ranked fix list.

Every finding maps a specific tensor, call site, or operator sequence to a concrete fix — with expected bandwidth savings, effort score, and a risk tier before you touch a line of code.

Call-site–level attribution, not just op-level summaries
Ranked by expected memory traffic reduction
Effort scored: one-liner fixes surface first
Safety tier: each fix tagged LOW / MED / HIGH risk

layout-advisor / resnet50_train

4 findings

Layout copies

38/step

BW overhead

~22%

Quick wins

#FINDING / FIXIMPACTRISK

Non-contiguous tensor passed to matmul at step 142

→ Move .contiguous() before loop entry, not inside loop body

−19% BWLow

CNN workload using NCHW — cuDNN NHWC path available

→ model.to(memory_format=torch.channels_last)

+18% tputMedium

12× permute() / transpose() chain before attention projection

→ Store QKV in target layout at projection output

−8% copyLow

Dynamic strides causing torch.compile graph breaks

→ Pin layout with .contiguous() before compile boundary

eliminates recompileLow

Memory layout analysis complete · 4.2s · resnet50_train

Roadmap

Advisor first. Autopilot when it's earned.

Changing memory layout is high-leverage but high-risk. We walk the graduation from diagnosis to safe controlled trials to full autopilot — with validation gates at every step.

01 · MVPShipping now

Layout Diagnostics

Non-contiguous tensor detection in hot paths
Repeated .contiguous() call flagging
Stride pattern analysis for common ops
Channels-last opportunity scan

02 · V1In progress

Fix Recommendations

Suggest channels_last for CNN workloads
Layout stabilisation points upstream of hot ops
Ranked by expected bandwidth savings
Effort and compatibility scoring

03 · V2Early access

Controlled Layout Trials

Safe A/B layout experiments with guardrails
Throughput + output equivalence validation
Memory delta and kernel selection comparison
Rollback on divergence or regression

04 · AutopilotOn the roadmap

Memory Layout Autopilot

Policy-driven layout optimisation
Continuous adaptation across training runs
Custom op and distributed-safe guardrails
Auditable change log per layout decision

Safety model

Layout changes can break things. We know that.

Changing tensor layout can affect numerics, kernel selection, memory use, distributed behaviour, and graph compilation. Every autopilot-mode change is validated against all of these before promotion.

Validation pipeline

Numerics

tolerance check

Kernel IDs

dispatch verify

Graph compile

stride stability

Distributed

collective guards

Promote

safe to apply

Numerics

Layout changes can alter operator dispatch and precision paths. Every recommendation is validated against output tolerance before promotion.

Kernel selection

Different layouts can route to different CUDA/cuDNN kernels. Fournex captures kernel IDs before and after to detect unexpected dispatch changes.

Custom ops & distributed

Custom CUDA kernels and distributed collectives may have layout expectations. The advisor flags these and conservatively skips unsafe changes.

Graph compilation

torch.compile and CUDA Graphs are sensitive to stride changes. Layout trials run inside the same compilation mode as your baseline to catch breaks early.

The advisor is conservative by design. We skip any layout change we can't validate end-to-end. Recommendations are separated from benchmarked trials — you choose when to escalate from advice to action.

Read the safety model

Memory Layout Advisor · Early access

Find out where your model is fighting its own memory layout.

Run a free layout diagnostic on your PyTorch training loop. See non-contiguous tensors, layout conversion overhead, and channels-last opportunities — in minutes.

Run a free layout scan Read the docs

PyTorch + NVIDIANo code changes to onboardApache 2.0 license

Stop losing GPU throughputto bad memory layout.

Six memory layout problems. All GPU-throughput-killing.

Non-contiguous tensor detection

Contiguous call churn

Permute / transpose storm

NHWC / channels-last opportunity

Memory-bound kernel identification

torch.compile / CUDA Graph stability

From operator traceto ranked fix list.

Advisor first. Autopilot when it's earned.

Layout Diagnostics

Fix Recommendations

Controlled Layout Trials

Memory Layout Autopilot

Layout changes can break things. We know that.

Numerics

Kernel selection

Custom ops & distributed

Graph compilation

Find out where your model is fighting its own memory layout.

Stop losing GPU throughput
to bad memory layout.

From operator trace
to ranked fix list.