Uncoalesced global loads
Catch memory access patterns that spray global-load transactions. Get coalescing actions and a re-profile command to confirm the fix.
Run frx init to check your machine, detect the GPU model, and get the right command for CUDA kernels or PyTorch training. Add --explain to write the LLM brief in the same run.
Opportunity-ranked kernels, framework tax signals, and architecture-aware thresholds for H100, A100, L4, RTX4090, and RTX5060.
Works with
Trusted by teams shipping the world's most compute-intensive workloads
Fournex turns NCU counters, PTX structure, and runtime traces into the mistake that matters: the threshold crossed, the kernel worth fixing first, and the validation command to prove the change.
Catch memory access patterns that spray global-load transactions. Get coalescing actions and a re-profile command to confirm the fix.
Separate L1-only locality problems from broader memory pressure. Prioritize shared-memory tiling or working-set reduction based on evidence.
Identify L2-only pressure without confusing it with L1 misses. Fournex points engineers toward reducing the active working set first.
Use measured occupancy and launch-resource data to spot register pressure that blocks residency, then reduce live ranges or split kernels.
Surface kernels that keep schedulers waiting even when occupancy looks acceptable. Separate ILP, block-size, and warp-eligibility issues.
Connect runtime traces to practical fixes for dataloaders, pinned memory, transfer pressure, and hidden synchronization points.
Rank kernels by runtime share, roofline region, MFU gap, and severity so the highest-impact work rises above small noisy offenders.
Surface framework overhead that is not explained by input, copy, or sync stalls, with inferred graph-capture and fusion opportunities clearly marked.
Every finding comes with the exact signal that triggered it, the threshold it crossed, confidence level, actions to take, validation steps, and caveats before you change code.
We don't ask engineers to trust blind automation. The product starts with evidence, then moves through ranked recommendations, before/after validation, benchmark proof, and guarded automation.
Start with a concrete diagnosis, not a wall of counters. Keep the full evidence trail, from kernel opportunity ranking to benchmark proof, when you need to defend the change in review.
frx init checks Python, PyTorch, CUDA, Nsight Compute, and the active GPU model.
Each recommendation names the metric, threshold, rule, and validation step.
Separate L1 thrashing, L2 thrashing, bandwidth pressure, and uncoalesced loads.
Tie low occupancy to registers, shared memory, block size, or scheduler pressure.
Compare source, PTX, NCU, and bench evidence before trusting a change.
Rank kernels by runtime share, roofline region, MFU gap, and severity.
Spot launch fragmentation and inferred graph-capture or fusion opportunities.
Use --explain on profile or collect to write the measured prompt in the same workflow.
Let Fournex check your environment, recognize the GPU model, and route you to the right workflow. For NCU reports and PyTorch training runs, add --explain to generate the LLM-ready brief without running a second analysis command.
pip install fournexfrx initfrx init --patch train.pyfrx collect --explain -- python train.pyfrx ncu-command -- ./my_binaryfrx profile --explain -- python train.pyfrx profile --ncu ncu_report.csv --explainfrx profile --ptx kernel.ptxfrx compare baseline.cu optimized.cufrx bench bad.cu good.cuFournex keeps the path small: capture the evidence, name the mistake, apply the recommendation, and rerun the same command to verify the change.
guided first run
GPU model detection
summary, prompt, evidence
threshold evidence
Engineers do not need to interpret every counter by hand. The report explains why a metric is bad, which rule fired, what to change, and how to confirm the result.
View the docsWe're not another rules engine. Every analyzed workload expands the mapping from trace patterns to validated fixes — turning usage into defensibility.
Better data → better policies → better outcomes → more workloads. That loop is the product.