Non-contiguous tensor detection
Identifies tensors passed to matmul, Conv, and attention whose strides are non-contiguous, causing PyTorch to silently insert a .contiguous() copy and inflate memory traffic.
Many slowdowns aren't compute-bound — they're non-contiguous tensors, layout churn, and missed NHWC fast-paths. Fournex detects, explains, and safely tests the fixes.
We inspect tensor shapes, strides, contiguity, operator traces, and profiler bandwidth events to identify layout decisions that silently cap your effective GPU bandwidth.
Identifies tensors passed to matmul, Conv, and attention whose strides are non-contiguous, causing PyTorch to silently insert a .contiguous() copy and inflate memory traffic.
Counts .contiguous() calls across the operator trace and flags call sites that could be eliminated by stabilising layout once earlier in the forward pass.
Detects high-frequency permute() and transpose() chains that change logical layout without realising tensors, and identifies where a single change upstream removes the whole chain.
Compares input strides against cuDNN fast-paths. Flags convolution and pooling operators where switching to memory_format=torch.channels_last could unlock native NHWC kernels.
Cross-references profiler bandwidth events against the theoretical roofline to isolate kernels that are memory-bound due to poor spatial locality or fragmented allocation patterns.
Detects tensors whose strides change between steps — a common cause of recompilation in torch.compile and broken CUDA Graph capture — and recommends where to pin layout.
Every finding maps a specific tensor, call site, or operator sequence to a concrete fix — with expected bandwidth savings, effort score, and a risk tier before you touch a line of code.
Changing memory layout is high-leverage but high-risk. We walk the graduation from diagnosis to safe controlled trials to full autopilot — with validation gates at every step.
Changing tensor layout can affect numerics, kernel selection, memory use, distributed behaviour, and graph compilation. Every autopilot-mode change is validated against all of these before promotion.
Layout changes can alter operator dispatch and precision paths. Every recommendation is validated against output tolerance before promotion.
Different layouts can route to different CUDA/cuDNN kernels. Fournex captures kernel IDs before and after to detect unexpected dispatch changes.
Custom CUDA kernels and distributed collectives may have layout expectations. The advisor flags these and conservatively skips unsafe changes.
torch.compile and CUDA Graphs are sensitive to stride changes. Layout trials run inside the same compilation mode as your baseline to catch breaks early.
The advisor is conservative by design. We skip any layout change we can't validate end-to-end. Recommendations are separated from benchmarked trials — you choose when to escalate from advice to action.
Read the safety modelRun a free layout diagnostic on your PyTorch training loop. See non-contiguous tensors, layout conversion overhead, and channels-last opportunities — in minutes.