Fournex — CLI Reference
Developer documentation for the frx CLI. Covers installation, all CLI subcommands, bundle layout, and how the analysis pipeline works.
What's new
frx initnow checks Python, PyTorch, CUDA, GPU model, and Nsight Compute, then recommends the right workflow.frx init --patch train.pycan insert the training instrumentation snippet after a confirmation prompt.frx profile --ncu report.csv --explainprints the normal report and writesfrx_llm_prompt.txtin one command.frx collect --explain -- python train.pygenerates the training LLM brief from the newly created run directory.- GPU model detection now maps names like NVIDIA H100 80GB HBM3, NVIDIA A100-SXM4-80GB, and NVIDIA RTX 5060 to architecture profiles automatically.
Installation
Install the package from the repo root. Python 3.11+ is required. PyTorch is optional; the CLI works without it for bundle and static-file analysis. It is only needed when the SDK instruments a live training run.
pip install fournex
This registers frx as the short executable and fournex as the long executable through the existing package entry points. Start with:
frx init
Quickstart
Start with frx init. It checks the local environment, detects the GPU model when possible, finds likely training scripts, and prints the next command for either CUDA kernel profiling or PyTorch training telemetry.
Run frx init to check Python, PyTorch, CUDA, GPU model, and ncu before collecting evidence.
Use frx profile for NCU/PTX kernel work or frx collect for PyTorch training telemetry.
Add --explain to profile or collect to write frx_summary.txt, frx_llm_prompt.txt, and frx_evidence.json.
# First run: checks tools and recommends the right workflow frx init # Training path: collect telemetry and generate the LLM brief frx collect --explain -- python train.py # NCU path: analyze an existing report and generate the LLM brief frx profile --ncu ncu_report.csv --explain # Live profile with explain output frx profile --explain -- python train.py # No GPU available: inspect PTX statically frx profile --ptx kernel.ptx # Analyze locally (run-<id> is printed by collect) frx analyze runs/run-<id> # Upload by dragging runs/run-<id>.zip onto fournex.com/analyze # Optional: let autopilot sweep configs and find the fastest safe candidate frx tune --safe --max-trials 12 -- python train.py
Command index
The frx CLI now has 11 subcommands. First-time setup starts with init. Most kernel work starts with profile, analyze, compare, explain, or bench; collect remains the path for full application run bundles.
| Command | Purpose |
|---|---|
| init | Guided first-run setup; check dependencies, detect GPU model, and suggest or patch instrumentation |
| collect | Wrap and run a PyTorch workload; produce a run bundle |
| analyze | Load a bundle, CUDA file, PTX file, or NCU CSV and print a diagnosis |
| doctor | Check runtime dependencies |
| smoke-test | Synthetic end-to-end pipeline validation |
| ncu-command | Print an NCU command for Fournex metric presets |
| profile | Run NCU and print a full bottleneck + recommendation report |
| tune | Safe autopilot: sweep configs and recommend the fastest |
| compare | Diff two CUDA source files or evidence files; report what improved/regressed |
| explain | Generate an LLM-ready optimization brief from an NCU CSV or training run directory |
| bench | Compile and wall-clock benchmark two .cu kernels side-by-side |
init
frx initGuided first-run onboarding. It checks the local environment, detects the active GPU model through the architecture-profile matcher, finds likely training scripts, and prints the command that gets you to frx_llm_prompt.txt.
frx init frx init --patch train.py
Checks performed
| Check | What it reports |
|---|---|
| Python | Interpreter version and basic runtime availability |
| PyTorch | Installed version when importable |
| CUDA | CUDA availability and GPU name when PyTorch can see it |
| GPU model | Recognized architecture key such as h100, a100, or rtx5060 |
| ncu | Whether Nsight Compute is on PATH for kernel profiling |
| Training scripts | Likely files such as train.py, train_*.py, main.py, or run.py |
Patch mode
frx init --patch FILE inserts the recommended training instrumentation after the last top-level import, but only after a confirmation prompt. If the script already imports and initializes Fournex, it is skipped cleanly.
import fournex as frx
frx.init(job_name="my-run")
for step, batch in enumerate(dataloader):
with frx.step_context(step=step, batch=batch, model=model):
# your existing training step here
passprofile
frx profileThe simplest command for engineers working on CUDA kernels. It accepts either a workload to run, an existing Nsight Compute CSV, or a PTX file, then prints the bottleneck and the next fix to try.
frx profile --explain -- python train.py frx profile --ncu report.csv --explain frx profile --ptx kernel.ptx Options: --preset memory|tensor|occupancy|stalls|full --out save.csv --kernel-name FILTER --launch-skip N --launch-count N --gpu-model NAME Override automatic GPU model detection --explain Also write frx_summary.txt, frx_llm_prompt.txt, frx_evidence.json --json
When --explain is set, profile reuses the in-memory analysis result and writes the same explain files as frx explain. No second analysis pass is needed.
Choose the mode
| Situation | Command | Use this when |
|---|---|---|
| Run NCU now | frx profile --explain -- python train.py | You are on the GPU machine and ncu is on PATH. |
| Analyze a CSV | frx profile --ncu report.csv --explain | A teammate, CI job, or previous run already produced the NCU export. |
| Analyze PTX | frx profile --ptx kernel.ptx | You want static risk signals without running on the target GPU. |
Report layout
| Section | What to look at |
|---|---|
| VERDICT | Primary bottleneck, secondary bottlenecks, and confidence note. |
| MEASURED METRICS | Every metric with [!!], [ !], [ok], or [--] plus the threshold that explains the status. |
| BOTTLENECKS DETECTED | All detected bottlenecks with numeric score and ASCII score bar. |
| RECOMMENDATIONS | Every ranked fix with why, actions, validation steps, caveats, tier, score, and triggering rule. |
| NEXT STEPS | The exact re-run command and the top priority fix name. |
Presets
Use full when you do not know the bottleneck yet. Use a narrower preset when NCU overhead matters or you are iterating on one class of issue.
frx profile --preset full --explain -- python train.py frx profile --preset memory --kernel-name "my_kernel" --launch-count 5 -- python train.py frx profile --preset occupancy --gpu-model H100 --out ncu_occupancy.csv -- python train.py
collect
frx collectRuns a workload subprocess, samples GPU metrics in the background, imports profiler artifacts, runs the analysis pipeline, and writes a self-contained run bundle.
frx collect [OPTIONS] -- COMMAND [ARGS...]
Options:
--name NAME Human-readable job name (default: frx-run)
--out DIR Root output directory (default: runs)
--run-id ID Override auto-generated run ID
--artifact-dir DIR Import artifacts from DIR after the workload exits.
May be repeated. Default: ./frx-job-run
--no-profiler-import Skip importing profiler_trace.json from artifact dirs
--sample-interval-ms N nvidia-smi polling interval in ms (default: 1000)
--config FILE Optional run_config.yaml to merge into bundle config
--explain Generate frx_summary.txt, frx_llm_prompt.txt, and frx_evidence.json
--no-zip Skip creating the zip archiveUse --explain when you want the training workflow to end at the LLM handoff. After the run bundle is created and analyzed, Fournex generates the brief directly from that new run directory.
Artifact directory gotcha
If your workload writes profiler_trace.json somewhere other than frx-job-run/, pass that directory with --artifact-dir. Otherwise the trace exists on disk but will not be copied into the run bundle.
# Workload writes gpu-job-run-tiny-kernels/profiler_trace.json frx collect \ --name tiny-kernel-launch-overhead \ --out runs \ --sample-interval-ms 100 \ --artifact-dir gpu-job-run-tiny-kernels \ --explain \ -- python tiny_kernel_launch_overhead.py --output-dir gpu-job-run-tiny-kernels
What it does
- Writes
run_config.yamland injects env vars into the workload process so the SDK auto-persists events toraw/trace.jsonl. - Starts a background thread that polls
nvidia-smiat--sample-interval-msintogpu_metrics.csv. - Runs the workload. Stdout and stderr are tee'd to
optional_logs.txt. - After the workload exits, copies artifacts from
--artifact-dirinto the bundle (marked[imported]in the summary). - Runs the analysis pipeline over
raw/trace.jsonl(or the imported profiler bundle if no SDK trace exists) and writesderived/summary.json. - Writes
metadata.json,manifest.json, and zips the bundle. - When
--explainis set, writesfrx_summary.txt,frx_llm_prompt.txt, andfrx_evidence.json.
Environment variables injected into the workload
| Variable | Value |
|---|---|
| FRX_RUN_ID | Generated run ID |
| FRX_JOB_NAME | --name value |
| FRX_OUTPUT_DIR | Absolute path to the run directory |
| FRX_RAW_TRACE_PATH | raw/trace.jsonl absolute path |
| FRX_DERIVED_SUMMARY_PATH | derived/summary.json absolute path |
| FRX_AUTO_PERSIST | 1 |
| FRX_SAMPLE_INTERVAL_MS | --sample-interval-ms value |
Example output
frx collect completed Run bundle: runs/run-a1b2c3d4e5f6 Zip bundle: runs/run-a1b2c3d4e5f6.zip Captured (10 files): metadata.json manifest.json run_config.yaml gpu_metrics.csv optional_logs.txt raw/trace.jsonl derived/summary.json profiler/profiler_trace.json [imported] frx_summary.txt frx_llm_prompt.txt frx_evidence.json
analyze
frx analyzeUnified local analysis for run bundles, PTX, CUDA source, and Nsight Compute CSV exports. No GPU or PyTorch is required for static files, NCU CSV ingestion, or bundle analysis.
frx analyze [TARGET] [OPTIONS]
Arguments:
TARGET Optional path to a run directory, run zip, or one evidence file:
.ptx, .cu, .cuh, .cuda, .csv
Options:
--scope SCOPE run | steady_state | auto
Applies to run bundle analysis.
--before PATH Auto-detect one baseline evidence file.
--after PATH Auto-detect one optimized evidence file.
--before-source PATH Baseline CUDA source file.
--before-ptx PATH Baseline PTX file.
--before-ncu PATH Baseline Nsight Compute CSV.
--after-source PATH Optimized CUDA source file.
--after-ptx PATH Optimized PTX file.
--after-ncu PATH Optimized Nsight Compute CSV.
--before-label LABEL Label for the before side.
--after-label LABEL Label for the after side.
--baseline PATH Deprecated alias for --before.
--optimized PATH Deprecated alias for --after.
--json Output a stable JSON wrapper.
--output-json Alias for JSON output.
--gpu-model NAME Calibrate CUDA static launch advice to a GPU model.
Examples:
frx analyze runs/run-abc123
frx analyze runs/run-abc123.zip
frx analyze kernel.cu
frx analyze kernel.cu --gpu-model RTX4090
frx analyze kernel.ptx
frx analyze profile.csv
frx analyze kernel.cu --output-json
frx analyze --before before.csv --after after.csv
frx analyze --before before.ptx --after after.ptx
frx analyze --before before.csv --after after.csv --json
frx analyze --before-source before.cu --before-ptx before.ptx --before-ncu before.csv \
--after-source after.cu --after-ptx after.ptx --after-ncu after.csv \
--before-label baseline --after-label optimizedSupported inputs
frx analyze auto-detects code and profiler files directly. Use this when you want a quick local answer without collecting a full run bundle.
| Input | Mode | Analyzer |
|---|---|---|
| Run directory | run_bundle | Existing collected run bundle analysis |
| .zip | run_bundle | Existing zip bundle analysis |
| .ptx | ptx | analyze_ptx_text |
| .cu / .cuh / .cuda | cuda_source | inspect_cuda_source |
| .csv | ncu | analyze_ncu_csv_text |
| Text containing .entry or .version | ptx | PTX content fallback when extension is unclear |
frx analyze kernel.cu frx analyze kernel.cu --gpu-model RTX4090 frx analyze kernel.ptx frx analyze profile.csv frx analyze kernel.cu --output-json
Comparison routing
For new CUDA before/after reviews, prefer frx compare baseline.cu optimized.cu. The older--before and --after flags still auto-detect one evidence file per side and remain useful for CSV-only or PTX-only comparisons. --baseline and --optimized remain deprecated aliases.
Human reports
| Mode | Default terminal report |
|---|---|
| Single analysis | File type, diagnostic confidence, primary bottleneck, secondary bottlenecks, top recommendations, key evidence |
| CUDA source | Kernel count, launch count, findings by severity, launch advisor highlights |
| PTX | Kernel count, target, register pressure, spills, global/shared memory mix, FP64 usage, recommendations |
| NCU | Kernel count, measured bottleneck rank, DRAM/cache/issue/occupancy/stall evidence, recommendations |
| Comparison | Winner, score delta, dimensions won, resolved findings, new regressions, most important metric deltas |
JSON wrappers
--json or --output-json prints the raw result object under a stable mode wrapper for every supported path. Single-file analysis and comparison modes use the same contract.
{"mode": "ptx", "result": ...}
{"mode": "cuda_source", "result": ...}
{"mode": "ncu", "result": ...}
{"mode": "run_bundle", "result": ...}
{"mode": "comparison", "result": ...}Run bundle data source priority
analyze picks the best available data source in this order:
derived/summary.json— pre-analyzed, preferredraw/trace.jsonl— re-analyzed on the flyprofiler/profiler_trace.json+gpu_metrics.csv— imported and analyzed
Example output
--------------------------------------------------------
GPU Autopilot - Run Analysis
Run : run-a1b2c3d4e5f6
Scope: steady_state (28 steps)
--------------------------------------------------------
VERDICT
Primary Bottleneck : input_bound
Internal Signal : underutilized_gpu (symptom)
Confidence : high (0.88)
Reason : input_bound leads the ranking and matches the dominant stall summary.
EVIDENCE
- Average DataLoader wait fraction is 0.825.
- Run summary dominant stall type is input_bound.
PERFORMANCE SNAPSHOT
Avg GPU Utilization : 1.3%
Avg Memory Util : 12.0%
Peak Memory Pressure: 0.14
Avg Step Time : 207.000 ms
Throughput : 4.8 steps/sec
Dominant Stall : input_bound
TOP RECOMMENDATIONS (3 of 5)
1. [HIGH] Increase DataLoader num_workers
Effort: low | Risk: low | Score: 0.84
DataLoader wait is the dominant stall ...When underutilized_gpu is the internal top signal but a stall type (e.g. input_bound) is also present, the verdict displays the root cause. The raw internal signal is shown on the Internal Signal line.
Launch-bound traces and near-zero GPU samples
For tiny-kernel workloads, nvidia-smi sampling can report near-zero GPU utilization even when the profiler captured many CUDA kernels. Treat that as bursty GPU activity rather than proof that no GPU work ran. The launch-bound report uses profiler evidence such as kernel_count_per_step, median_cuda_kernel_duration_us, small_kernel_fraction, and stable shapes when available.
VERDICT Primary Bottleneck : launch_bound Confidence : medium (0.65) EVIDENCE - Profiler saw about 840.0 CUDA kernels per step with median duration 4.200 us. - GPU utilization sampling stayed low, which is expected for bursty tiny-kernel workloads. - Shapes were stable, so compile or CUDA graph mitigations are viable.
Zip bundle troubleshooting
If frx analyze cannot read a zip bundle, verify the archive contains at least one of derived/summary.json, raw/trace.jsonl, or profiler artifacts. Zip members with path traversal sequences are rejected automatically.
Input errors
A missing path exits nonzero with a clear path error. Unsupported file types also exit nonzero and suggest the supported inputs: run directories, run zip bundles, PTX, CUDA source, and Nsight Compute CSV.
doctor
frx doctorChecks that all runtime dependencies are present and configured. Exits with code 0 if all checks pass, 1 if any [FAIL] lines appear.
frx doctor
Checks performed
| Check | What it verifies |
|---|---|
| Python | Python version (always passes) |
| torch | PyTorch importable; reports version |
| CUDA available | torch.cuda.is_available(), GPU name and count |
| nvidia-smi | nvidia-smi on PATH (required for gpu_metrics.csv) |
| fournex.profiler | SDK profiler module importable |
| fournex.analysis | Analysis pipeline importable |
frx doctor [OK] Python 3.12.3 [OK] torch 2.3.0+cu121 [OK] CUDA available NVIDIA A100 x1 [OK] nvidia-smi /usr/bin/nvidia-smi [OK] fournex.profiler importable [OK] fournex.analysis importable All checks passed.
smoke-test
frx smoke-testWrites a synthetic input-bound Chrome-format profiler trace, runs the full collect + analysis pipeline end-to-end in a temp directory, and verifies the bundle and diagnosis output. Useful for CI and confirming the install is working.
frx smoke-test
Checks performed:
- Run directory and subdirs exist (
raw/,derived/,profiler/) derived/summary.jsonwas generatedmanifest.jsonis present- Zip bundle was created
- Diagnosis produced
primary_bottleneck == input_bound - At least one recommendation was generated
frx smoke-test Running smoke test ... [PASS] create run directory [PASS] write synthetic profiler trace [PASS] generate derived/summary.json [PASS] manifest.json present [PASS] zip bundle created [PASS] primary_bottleneck == input_bound [PASS] recommendations present [PASS] no unexpected warnings All smoke-test checks passed.
Tiny-kernel workload
TestingUse this pattern to intentionally stress the launch_bound detector. The workload creates many short CUDA kernels per step so the profiler captures the overhead signal clearly. Run it with frx collect and expect the analyzer to diagnose launch_bound.
Purpose
This is a synthetic stress workload for validating the optimizer pipeline. It is not representative of a production training script — the inefficiencies are intentional.
Patterns that trigger kernel launch overhead
| Pattern | Why it produces many small kernels |
|---|---|
| Repeated elementwise ops in a Python loop | Each op (add, mul, relu …) dispatches a separate CUDA kernel with no fusion |
| Many small matrix multiplications | Each matmul is an independent kernel launch with minimal compute per launch |
| Python loop over tensors | The Python loop serializes dispatches and prevents the JIT from fusing across iterations |
| Small batch size | Reduces arithmetic intensity so each kernel finishes in microseconds, amplifying launch overhead relative to compute |
Example test script
"""tiny_kernel_launch_overhead.py — intentionally launch-bound workload."""
import os, torch
from fournex.profiler import profile_kernels
OUTPUT_DIR = os.environ.get("FRX_OUTPUT_DIR", "gpu-job-run-tiny-kernels")
STEPS = 20
BATCH = 4 # intentionally tiny
DIM = 64 # intentionally tiny
device = torch.device("cuda")
with profile_kernels(output_dir=OUTPUT_DIR, steps=STEPS):
for step in range(STEPS):
x = torch.randn(BATCH, DIM, device=device)
# Pattern 1 — repeated elementwise ops (no fusion)
for _ in range(50):
x = x + 0.01
x = x * 1.001
x = torch.relu(x)
# Pattern 2 — many small matmuls in a Python loop
for _ in range(20):
w = torch.randn(DIM, DIM, device=device)
x = x @ w
# Pattern 3 — elementwise chain without compile
x = x.sin().cos().exp().log1p()
torch.cuda.synchronize() # flush before next stepCollect command
frx collect \ --name tiny-kernel-launch-overhead \ --out runs \ --sample-interval-ms 100 \ --artifact-dir gpu-job-run-tiny-kernels \ -- python tiny_kernel_launch_overhead.py
Expected diagnosis
VERDICT
Primary Bottleneck : launch_bound
Confidence : high (0.85)
Reason : Many short CUDA kernels detected; kernel launch overhead
dominates compute time.
EVIDENCE
- Profiler saw ~840 CUDA kernels per step with median duration 3-6 us.
- Small kernel fraction > 80% (kernels under 10 us).
- Shapes were stable across steps — compile and CUDA graph mitigations viable.
- GPU utilization sampling near-zero (expected for bursty tiny-kernel workloads).
TOP RECOMMENDATIONS
1. [HIGH] Use torch.compile (reduce-overhead mode)
Fuses elementwise chains and reduces dispatch count.
2. [HIGH] Fuse operations manually or use torch.vmap / torch.func
Combine the inner loops into batched ops.
3. [MEDIUM] Increase batch size
More work per kernel raises arithmetic intensity.
4. [MEDIUM] Consider CUDA Graphs
Captures the static kernel sequence and replays without Python overhead.Applying the mitigations
import torch
# Mitigation 1 — torch.compile (fuses elementwise ops automatically)
@torch.compile(mode="reduce-overhead")
def forward(x):
for _ in range(50):
x = x + 0.01
x = x * 1.001
x = torch.relu(x)
return x
# Mitigation 2 — replace the matmul loop with a single batched op
# Instead of: for _ in range(20): x = x @ w
weights = torch.randn(20, DIM, DIM, device=device)
x = torch.einsum("bd,nde->be", x, weights) # one kernel, not 20
# Mitigation 3 — increase batch size to amortize launch cost
BATCH = 256 # was 4Verifying the fix
Re-run frx collect after applying mitigations. The diagnosis should shift away from launch_bound — kernel count per step will drop and median kernel duration will increase as operations are fused. Use frx tune --bottleneck launch_bound --no-safe to let the runner benchmark torch.compile and CUDA Graphs automatically.
tune
frx tuneAutopilotRuns the experiment runner: captures a baseline, focuses candidate configs from a bottleneck diagnosis, validates safety before each trial, measures an explicit benchmark window, rejects quality regressions, and writes reproducible artifacts. The command remains recommendation-only; it does not rewrite your training config.
frx tune [OPTIONS] -- COMMAND [ARGS...] Options: --name NAME Job name for output directories (default: frx-tune) --out DIR Root output directory (default: runs) --max-trials N Max candidate configs to try (default: 12) --safe Tier-0 only: dataloader knobs (default) --no-safe Also try Tier-1: batch size and mixed precision --time-budget-s N Kill trial after N seconds (default: 60) --warmup-steps N Steps to skip before measuring (default: 5) --measure-steps N Steps to include in measurement (default: 20) --repeat-count N Repeats per baseline and candidate (default: 1) --no-race Disable quick candidate screening --race-promote-count N Candidates promoted from race to full benchmark (default: 3) --race-warmup-steps N Warmup steps for quick screening (default: 1) --race-measure-steps N Measurement steps for quick screening (default: 5) --bottleneck LABEL Focus candidates manually --min-speedup FLOAT Minimum improvement to recommend (default: 0.08 = 8%) --allow-risky-actions Allow high-risk candidates --no-quality-checks Do not require quality checks for precision changes --max-final-loss-regression FLOAT --max-loss-divergence FLOAT --output-abs-tolerance FLOAT --allow-nonfinite-loss --sample-interval-ms N GPU sampling interval (default: 1000)
Safety tiers
| Tier | Actions | Flag | Guardrails |
|---|---|---|---|
| 0 — Safe | num_workers, pin_memory, prefetch_factor, persistent_workers | --safe (default) | Exit code, step count, throughput not zero |
| 1 — Validated | batch_size, AMP fp16/bf16 | --no-safe | Same as Tier 0 + memory ratio < 90%, step time regression < 10% |
| 2 — Risky | distributed tuning, custom kernels | Not yet implemented | Requires explicit user approval |
Current implementation adds allocator candidates in the safe tier and runtime candidates such as torch.compile and CUDA Graphs in the validated tier when their preconditions pass.
Staged search order
Candidates are generated in stages so the trial budget is spent efficiently — no brute-force grid across all knob combinations.
Screen race pass short benchmark all candidates, then promote top N Stage 1 dataloader num_workers × pin_memory grid + prefetch_factor variants Stage 2 batch size 1.25×, 1.5×, 2× baseline (--no-safe required) Stage 3 precision bf16 (Ampere+), fp16 (--no-safe required)
Race-stage trials are screening signals only. The final winner must still come from a full benchmark and pass the normal guard, quality, and noise checks.
Recommendations vs. tune trials
Recommendations are diagnosis-driven fix cards. They are ranked by signal strength, expected impact, effort, and risk, but they are not proof that a change already improved your workload. Tune trials are executable config candidates that the runner actually benchmarks.
| Surface | Source | Use it for |
|---|---|---|
| Recommendation | Diagnosis + rule catalog | Prioritizing what to inspect or test next |
| Race trial | Short benchmark window | Screening candidates before full measurement |
| Full tune trial | Full benchmark window + guardrails | Choosing the recommendation-only winner |
Diagnosis-focused candidates
The runner now focuses candidates from the baseline diagnosis when it can read one from derived/summary.json. Use --bottleneck to override that focus manually.
| Bottleneck | Candidate family |
|---|---|
| input_bound | DataLoader workers, pin_memory, prefetch_factor |
| copy_bound | Pinned-memory-focused DataLoader candidates |
| launch_bound | torch.compile and CUDA Graphs when --no-safe is enabled |
| memory_pressure | CUDA allocator settings, then mixed precision when --no-safe is enabled |
| underutilized_gpu | Batch size, mixed precision, then runtime candidates when --no-safe is enabled |
Pre-run safety validation
Unsafe candidates are skipped before execution. They still get a trial directory with config.yaml, metrics.json, and stderr.log explaining the rejection reason.
| Check | Rejects when |
|---|---|
| Risk policy | Candidate is high risk and --allow-risky-actions is not set |
| Batch size | Memory headroom is below the safe threshold |
| Precision | CUDA is unavailable, bf16 is unsupported, or quality checks are required |
| CUDA Graphs | Shapes appear dynamic or CUDA is unavailable |
| torch.compile | Compile is marked unsupported or dynamic behavior is incompatible |
Benchmark window
Each trial writes an explicit benchmark_window.json. Metrics prefer measurement_window when per-step data is available, then fall back to steady_state and full run metrics.
benchmark_window.json
{
"warmup_steps": 5,
"measurement_steps": 20,
"repeat_count": 1,
"timeout_s": 60
}Env vars injected per trial
Each trial subprocess receives the standard FRX_* collect vars plus these tune-specific ones. The workload reads them to configure itself — see the SDK integration section for how to wire them up.
| Variable | Set by | Purpose |
|---|---|---|
| FRX_TUNE_WARMUP_STEPS | tune runner | Steps to skip before measurement; workload should exit early |
| FRX_TUNE_MEASURE_STEPS | tune runner | Measurement steps requested |
| FRX_TUNE_MAX_STEPS | tune runner | Total steps (warmup + measure); workload exits at this count |
| FRX_TUNE_REPEAT_COUNT | tune runner | Repeat count used for noise-aware comparison |
| FRX_NUM_WORKERS | dataloader tuner | DataLoader num_workers value to use |
| FRX_PIN_MEMORY | dataloader tuner | 'true' or 'false' |
| FRX_PREFETCH_FACTOR | dataloader tuner | DataLoader prefetch_factor value |
| FRX_PERSISTENT_WORKERS | dataloader tuner | 'true' or 'false' |
| FRX_BATCH_SIZE | batch size tuner | Absolute batch size to use (Tier 1) |
| FRX_AMP_DTYPE | mixed precision tuner | 'bfloat16' or 'float16' (Tier 1) |
| FRX_TORCH_COMPILE | runtime tuner | Enable torch.compile when supported |
| FRX_TORCH_COMPILE_MODE | runtime tuner | Compile mode such as reduce-overhead |
| FRX_CUDA_GRAPHS | runtime tuner | try_if_static_shapes |
| PYTORCH_CUDA_ALLOC_CONF | memory tuner | CUDA allocator configuration |
Trial artifacts
runs/
tune-<id>/
baseline/
config.yaml
benchmark_window.json
metrics.json
stdout.log
stderr.log
derived/summary.json
raw/trace.jsonl
race/
<candidate-id>/
config.yaml
benchmark_window.json
metrics.json
stdout.log
stderr.log
<candidate-id>/
config.yaml
benchmark_window.json
metrics.json
stdout.log
stderr.log
autopilot_report.json
report.mdWorkload integration
The workload reads the injected env vars and applies them. The minimal pattern for dataloader tuning:
import os
num_workers = int(os.environ.get("FRX_NUM_WORKERS", "4"))
pin_memory = os.environ.get("FRX_PIN_MEMORY", "true") == "true"
prefetch_factor = int(os.environ.get("FRX_PREFETCH_FACTOR", "2"))
persistent = os.environ.get("FRX_PERSISTENT_WORKERS", "true") == "true"
max_steps = int(os.environ.get("FRX_TUNE_MAX_STEPS", "0")) or None
loader = DataLoader(
dataset,
batch_size=batch_size,
num_workers=num_workers,
pin_memory=pin_memory,
prefetch_factor=prefetch_factor if num_workers > 0 else None,
persistent_workers=persistent and num_workers > 0,
)
for step, batch in enumerate(loader):
if max_steps and step >= max_steps:
break
# ... training step ...For AMP and batch size (--no-safe):
import torch, os
amp_dtype_str = os.environ.get("FRX_AMP_DTYPE") # "bfloat16" | "float16" | None
amp_dtype = getattr(torch, amp_dtype_str, None) if amp_dtype_str else None
batch_size = int(os.environ.get("FRX_BATCH_SIZE", "32"))
with torch.autocast("cuda", dtype=amp_dtype, enabled=amp_dtype is not None):
loss = model(batch)Quality regression gates
A faster candidate is rejected if quality metrics regress. Loss is read from step_end.payload.loss when the workload emits it, and output drift checks are used when present in the summary quality fields.
| Gate | Default | Flag |
|---|---|---|
| Final loss vs baseline | Reject if worse by more than 5% | --max-final-loss-regression |
| Trial loss divergence | Reject if final loss grows more than 50% | --max-loss-divergence |
| NaN/Inf loss | Reject | --allow-nonfinite-loss |
| Output absolute drift | Reject above 0.005 when reported | --output-abs-tolerance |
Example output
frx autopilot — starting tune run tune-3f8a12b4
Workload : python train.py
Max trials: 12 | Time budget: 60s/trial
Running baseline...
Baseline: 4.8 steps/sec (exit=0, steps=25)
Generated 8 candidates
Running quick race stage (1 warmup + 5 measure steps)...
[1/8] race: dl:nw=0,pin=T ...
[RACE] dl:nw=0,pin=T +1.2% (exit=0, steps=6)
[2/8] race: dl:nw=2,pin=T ...
[RACE] dl:nw=2,pin=T +11.4% (exit=0, steps=6)
[3/8] race: dl:nw=4,pin=T ...
[RACE] dl:nw=4,pin=T +19.3% (exit=0, steps=6)
...
Quick race promoted 3 of 8 candidates
[1/3] full: dl:nw=4,pin=T ...
[PASS] dl:nw=4,pin=T +19.3% (exit=0, steps=25)
[2/3] full: dl:nw=8,pin=T ...
[PASS] dl:nw=8,pin=T +18.1% (exit=0, steps=25)
[3/3] full: amp:fp16 ...
[FAIL] amp:fp16 +28.0% (exit=0, steps=25)
! quality regression: final loss 1.2 exceeds baseline 1 by more than 5%
Report saved: runs/tune-3f8a12b4/autopilot_report.json
Markdown report saved: runs/tune-3f8a12b4/report.md
──────────────────────────────────────────────────────────
frx autopilot — Tune Report
Job : frx-tune
Trials : 8 candidates + baseline
──────────────────────────────────────────────────────────
BASELINE
Throughput : 4.80 steps/sec
Avg step : 208.3 ms
GPU util : 1.3%
Dominant stall: input_bound
TRIAL RESULTS
dl:nw=4,pin=T +19.3% ✓
dl:nw=8,pin=T +18.1% ✓
dl:nw=4,pin=T [RACE] +19.3% promoted to full benchmark
dl:nw=8,pin=T [RACE] +18.1% promoted to full benchmark
dl:nw=2,pin=T [RACE] +11.4% screened out by quicker candidates
dl:nw=0,pin=T [RACE] +1.2% screened out by quicker candidates
WINNER
Config : dl:nw=4,pin=T
Throughput : 5.73 steps/sec (+19.3% vs baseline)
Avg step : 174.5 ms
GPU util : 4.1%
ENV VARS TO APPLY
FRX_NUM_WORKERS=4
FRX_PIN_MEMORY=true
FRX_PERSISTENT_WORKERS=true
FRX_PREFETCH_FACTOR=2
Applied: No — recommendation only
To apply: set the env vars above before launching your workload.Promotion thresholds
A candidate is promoted only if it clears all of these. Noisy sub-threshold improvements are not recommended.
| Guard | Default | Flag |
|---|---|---|
| Minimum throughput improvement | ≥ 8% | --min-speedup |
| Peak GPU memory ratio | < 90% | — |
| Step time regression | < 10% worse than baseline | — |
| Exit code | 0 (clean exit) | — |
| Minimum steps captured | ≥ 3 | — |
| Quality gates | Loss and numerics must pass | quality flags above |
Current boundary
Repeated trials now use median throughput and a measured noise band. Interleaved ordering such as baseline A, trial, baseline B is still future comparator work.
Bundle layout
Each collect run produces one directory under --out (default runs/) and a zip of it.
runs/
run-<id>/
metadata.json # Run metadata, artifact list, warnings
manifest.json # Included files, limited-data flag
run_config.yaml # Collector config + detected environment
gpu_metrics.csv # nvidia-smi samples (util %, memory, clocks)
optional_logs.txt # Combined workload stdout + stderr
raw/
trace.jsonl # SDK event stream (one JSON object per line)
derived/
summary.json # Pre-analyzed output — preferred by analyzer
profiler/
profiler_trace.json # Chrome-format torch.profiler trace (imported)
run-<id>.zip # All of the above, zipped for uploadRoot-layout compatibility
frx analyze also accepts zip archives that omit the run-id/ prefix, placing derived/summary.json, raw/trace.jsonl, and profiler artifacts at the archive root. Both layouts are detected automatically.
File roles
| File | Source | Required for analysis |
|---|---|---|
| derived/summary.json | Generated by collect | Preferred — fastest path |
| raw/trace.jsonl | SDK auto-persist | Yes, if no derived summary |
| profiler/profiler_trace.json | Imported from --artifact-dir | Fallback if no SDK trace |
| gpu_metrics.csv | nvidia-smi poller | Enriches GPU util data |
| metadata.json | Generated by collect | No (informational) |
| run_config.yaml | Generated by collect | No (informational) |
| optional_logs.txt | Workload stdout/stderr | No (debugging) |
The web analyzer scores bundle files when you upload multiple files at once. derived/summary.json scores highest (120 pts) and is used automatically when present.
Analysis pipeline
The analysis pipeline is pure Python with no GPU required. It accepts the SDK event stream or events reconstructed from a Chrome-format profiler trace, and produces a structured summary dict.
from fournex.analysis import summarize_run_with_steady_state summary = summarize_run_with_steady_state(events) # summary["steady_state"]["diagnosis"]["user_facing_bottleneck"] # → "input_bound"
Summary shape
| Key | Description |
|---|---|
| event_count | Total events in the input stream |
| step_count | Steps detected across the full run |
| selector | steady_state window policy (skip_first_n, last_k) |
| run | Scope object for all steps |
| steady_state | Scope object for warm-up-excluded steps |
| scope_comparison | Whether primary bottleneck changed between scopes |
Each scope object contains per_step (timing breakdown per step), run_summary (aggregated metrics), bottlenecks (scored list), and diagnosis (primary bottleneck + recommendations).
Symptom vs. root cause
underutilized_gpu often scores highest (the GPU is idle) but it is a symptom, not a cause. When a stall-type bottleneck (e.g. input_bound) is also present, the diagnosis.user_facing_bottleneck field is set to that root cause. The internal top signal is preserved in diagnosis.primary_bottleneck.
{
"primary_bottleneck": "underutilized_gpu", // internal top signal
"user_facing_bottleneck": "input_bound", // shown to users
...
}Framework Abstraction Tax
Runtime summaries can include framework_abstraction_tax when profiler windows are available. The score estimates GPU idle time that is not already explained by input, copy, or sync stalls, then scales it by launch-stream fragmentation.
| Signal | Interpretation |
|---|---|
| score | 0-100 headline value; current bands are reasoned V1 thresholds, not speedup-calibrated |
| severity | Low, medium, or high label derived from score bands |
| contributors | Drivers such as kernel launch fragmentation plus inferred graph-capture or fusion opportunities |
| inferred | true when Fournex is suggesting an opportunity rather than asserting framework state |
The key guardrail is subtraction of data-pipeline idle: an input-bound workload should not be mislabeled as framework tax just because the GPU is idle.
System components
Fournex has separate analyzers for runtime events, source, PTX, NCU CSVs, and before/after evidence. The recommendation layer sits above those analyzers so CLI, API, and web reports use the same rules and catalog entries.
| Component | Role |
|---|---|
| ncu_analysis.py | Ingests wide-format Nsight Compute CSV, extracts warp stall types, DRAM/L1/L2 throughput, tensor core utilization, ISU, occupancy, launch-resource metrics, and kernel attribution. Produces NcuResult with bottleneck classification, ranked recommendations, and top kernel opportunities. |
| kernel_inspector.py | Static CUDA inspector using pure-regex source analysis for memory access, thread indexing, sync hazards, data types, resource pressure, and NCU kernel duration aliases. Its source summary feeds CUDA antipattern rules and comparison scoring. |
| kernel_attribution.py | Computes per-kernel runtime share, MFU percentage, roofline region, opportunity label, and opportunity_score from NCU kernel summaries and architecture profiles. |
| ptx_analysis.py | Pure-regex PTX parser for virtual register pressure, spill detection, instruction mix, memory operations, tensor operations, and control-flow risk. |
| reconciliation.py | Merges source, PTX, NCU, and profiler signals into 6 unified diagnoses with confidence labels. Exposed through POST /reconcile. |
| framework_abstraction_tax.py | Computes a conservative framework abstraction tax score from runtime summaries by isolating GPU idle not explained by input, copy, or sync stalls and scaling it by launch-stream fragmentation. |
| recommendations/engine.py | Combines extracted signals, YAML rules, and catalog entries. It attaches current_value to validation steps and threads estimated_speedup_pct_min/max through recommendation output. |
| explain.py | Builds explain results for NCU CSVs and training run directories, then renders frx_summary.txt, frx_llm_prompt.txt, and frx_evidence.json. |
| bench.py | Runs compile_kernel, time_binary with warmup-discarded wall-clock timings, optional profile_with_ncu, and bench_compare. Produces frx_bench_v0 output. |
Data flows
These are the main execution paths behind the CLI. All three produce structured dictionaries first, then render human output for the terminal or files.
Analyze flow
frx analyze TARGET -> detect input type -> analyze bundle, CUDA source, PTX, or NCU CSV -> generate recommendations -> print diagnosis or JSON wrapper
Explain flow
frx profile --ncu report.csv --explain -> analyze NCU counters -> reuse the in-memory ncu_result -> build the bottleneck-specific brief -> write frx_summary.txt -> write frx_llm_prompt.txt -> write frx_evidence.json frx collect --explain -- python train.py -> collect and analyze the new run directory -> build the training telemetry brief -> write frx_summary.txt, frx_llm_prompt.txt, frx_evidence.json
Bench flow
frx bench bad.cu good.cu -> compile_kernel x2 with nvcc -> time_binary x2 with warmup + measured runs -> optional profile_with_ncu x2 to collect CSV evidence -> bench_compare -> print speedup table + bottleneck diff
Architecture-Aware Scoring
GPU model selection is a scoring input, not a display hint. Fournex now detects common NVIDIA product names automatically when a GPU name is available. Pass --gpu-model only when you need to override the detected model or analyze code for a different target deployment.
# Automatic when the environment reports a known GPU name NVIDIA RTX 5060 -> rtx5060 NVIDIA H100 80GB HBM3 -> h100 NVIDIA A100-SXM4-80GB -> a100 # Explicit override still has highest precedence frx analyze kernel.cu --gpu-model H100 frx compare baseline.cu optimized.cu --gpu-model RTX4090 frx profile --ptx kernel.ptx --gpu-model L4
| Family | Examples | Why it matters |
|---|---|---|
| sm_75 | Turing / T4 | Lower shared-memory threshold, older tensor-core alignment expectations |
| sm_80 | Ampere / A100 | Fallback thresholds when no GPU model is detected or supplied |
| sm_86 | RTX 30xx | Ampere-family consumer GPU limits |
| sm_89 | Ada / L4 / RTX 40xx | Ada launch and resource thresholds |
| sm_90 | Hopper / H100 | Higher shared-memory tolerance, relaxed register pressure, Hopper tensor-core alignment |
| sm_100 | Blackwell datacenter | Blackwell-family placeholder for datacenter parts |
| sm_120 | Blackwell consumer / RTX 50xx | Consumer Blackwell calibration, including RTX 5060-class limits |
Product names are accepted too: T4, A100, H100, L4, RTX4090, RTX5060, and similar aliases resolve to an SM family. If no model is detected or provided, Fournex uses Ampere defaults, which may flag shared-memory usage that is fine on Hopper.
| Signal | Architecture effect |
|---|---|
| Shared memory | Large shared-memory usage can be risky at 48 KB on Turing but acceptable near 96 KB on Hopper. |
| Register pressure | Ampere flags lower register-variable counts than Hopper, where larger kernels can be normal. |
| Tensor core alignment | WMMA paths expect 16-multiple alignment; Hopper WGMMA paths can require 64-multiple alignment. |
CUDA analysis schemas
The PyTorch SDK event IR remains the run-bundle event model. CUDA analysis adds two output schemas beside it for explain and bench workflows.
frx_explain_v0
Produced by build_explain_result(ncu_result, static_result). It is designed for human summaries, LLM prompts, and dashboard ingestion.
| Field | Description |
|---|---|
| schema | Schema name: frx_explain_v0 |
| layers_available | Which layers contributed evidence |
| primary_diagnosis | Top diagnosis selected for the brief |
| diagnoses | All diagnosis objects and confidence labels |
| key_metrics | Profiler and NCU metrics used in the explanation |
| static_findings | CUDA source findings included in the brief |
| ncu_bottlenecks | NCU bottleneck labels with score 0-1 |
| top_kernels | NCU kernels ranked by opportunity score with MFU, roofline region, and runtime share |
| roofline | NCU roofline region and MFU summary |
| occupancy_summary | NCU occupancy limiter and efficiency breakdown |
| top_recommendations | Recommendation cards with id, title, priority, tier, score, estimated_speedup_pct_min/max, why, first two actions, and validation_steps |
| missing_data | Missing layer and reason entries for evidence the user should collect next |
frx_bench_v0
Produced by bench_compare(before_src, after_src, ...). It reports compile status, timing distributions, optional NCU evidence diffs, and the computed speedup.
| Field | Description |
|---|---|
| schema | Schema name: frx_bench_v0 |
| arch | Architecture or GPU model used for compilation/profiling |
| before | Source path plus timing: median_ms, min_ms, max_ms, stdev_ms, runs, and warmup |
| after | Same timing object for the optimized side |
| speedup_x | before.median / after.median; values greater than 1 mean after is faster |
| ncu_diff | null or diff_ncu_runs output when optional NCU profiling is enabled |
| compile_errors | Compilation failures by side |
validation_step shape
{
"metric": "NCU metric name",
"label": "human-readable metric label",
"direction": "decrease | increase | stable",
"threshold_good": 50.0,
"expected": "rises above 50%",
"current_value": 7.0
}current_value is a float when measured from the actual NCU run and null when that metric was not in the CSV.
Bottleneck labels
| Label | Display name | Signal |
|---|---|---|
| input_bound | Input Pipeline Starvation | DataLoader wait ≥ 20% of step time |
| copy_bound | Host-to-Device Copy Overhead | H2D copy time ≥ 15% of step time |
| sync_bound | Synchronization Overhead | Sync wait ≥ 10% of step time |
| underutilized_gpu | GPU Under-utilization | GPU utilization < 35% (symptom) |
| memory_pressure | Memory Pressure | Peak memory ratio ≥ 90% |
| shape_instability | Shape Instability | Shape volatility ratio ≥ 30% |
| launch_bound | Kernel Launch Overhead | Profiler windows with many short kernels, stable shapes, low sampled util, and no dominant input/copy/sync stall |
| memory_bandwidth_bound | Memory Bandwidth Saturation | NCU DRAM throughput is high and memory stalls dominate |
| warp_stall_memory | Memory Warp Stalls | NCU dominant warp stall is a memory stall reason |
| warp_stall_sync | Synchronization Warp Stalls | NCU dominant warp stall is barrier or wait |
| l1_cache_thrashing | L1 Cache Thrashing | NCU L1 cache hit rate is below 40% |
| l2_cache_thrashing | L2 Cache Thrashing | NCU L2 cache hit rate is below 50% without L1-heavy cache misses |
| uncoalesced_access | Uncoalesced Global Loads | NCU global load sectors per request is above 4 |
| tensor_core_underutilized | Tensor Core Under-utilization | NCU tensor pipe utilization is low while occupancy is adequate |
| occupancy_limited | Occupancy Limited | Measured achieved occupancy is below threshold |
| occupancy_limited_by_registers | Occupancy Limited by Registers | Low occupancy with registers identified as the launch resource limit |
| occupancy_limited_by_shared_memory | Occupancy Limited by Shared Memory | Low occupancy with shared memory per block limiting residency |
| occupancy_limited_by_block_size | Occupancy Limited by Block Size | Low occupancy from too few resident threads or blocks per SM |
| low_warp_scheduler_utilization | Low Warp Scheduler Utilization | Few eligible warps per scheduler cycle or low scheduler active percentage |
| low_issue_efficiency | Low Issue Efficiency | NCU issue slot utilization is below threshold |
| insufficient_telemetry | Insufficient Telemetry | No timing data and no GPU util samples |
| insufficient_ncu_data | Insufficient NCU Data | NCU CSV parsed but did not contain enough recognized metrics to classify |
Labels are stable identifiers used in summary.json, CLI output, and the web frontend. The recommendation engine maps each label to a set of ranked fix cards.
Recommendation reference
Recommendation IDs are stable keys from the rule catalog. They appear in API responses, JSON summaries, CLI recommendation cards, and withheld-recommendation explanations.
Validation commands
NCU recommendation cards now include a Validate section with the exact counter command to run, the expected direction of movement, the current value when present in the NCU CSV, and an optional target. This keeps fixes testable instead of relying on intuition.
Validate:
ncu --metrics dram__throughput.avg.pct_of_peak_sustained_elapsed,l1tex__t_sector_hit_rate.pct \
--csv ./report.csv ./your_app
<-- DRAM throughput %: was 82.0; decreases as shared memory absorbs repeated global accesses
--> L1 cache hit rate %: was 31.0; increases as shared-memory tiling reduces L1 misses (target: 60.0)If a metric was not present in the NCU CSV, the was X prefix is omitted rather than fabricated. In JSON, each validation step includes current_value, which is numeric when known and null when the evidence was not collected.
PTX static-analysis recommendations
| ID | Emitted for | Guidance |
|---|---|---|
| rec_ptx_reduce_register_pressure | ptx_register_spills or ptx_register_pressure | Reduce live ranges, split large kernels, lower per-thread temporaries, or inspect compiler spill causes. |
| rec_ptx_stage_global_memory | ptx_global_memory_heavy | Stage reused global-memory data through shared memory or improve access locality when the access pattern allows it. |
| rec_ptx_reduce_fp64 | ptx_fp64_usage | Confirm FP64 is required; use FP32, TF32, FP16, or BF16 where numerical tolerance permits. |
| rec_ptx_reduce_branch_divergence | ptx_branch_divergence_risk | Reduce data-dependent branching, split divergent paths, or restructure predicates around warp-uniform work. |
PTX recommendation bundles use the ptx_static label so callers can distinguish static compiler-signature advice from profiler-measured runtime advice.
CLI/API output examples
CLI and API analysis responses share the same diagnosis-oriented fields: primary_bottleneck, ranked bottlenecks, ranked recommendations, and lower-level evidence for callers that need raw diagnostics.
CLI single-file examples
frx analyze kernel.ptx
frx analyze kernel.cu
frx analyze ncu_report.csv
frx analyze kernel.ptx --json
# {"mode": "ptx", "result": {...}}CLI comparison examples
frx compare baseline.cu optimized.cu frx compare baseline.cu optimized.cu --with-ptx --with-ncu frx compare baseline.cu optimized.cu --ncu-a baseline.csv --ncu-b optimized.csv # Older single-evidence comparison path remains available frx analyze --before before.ptx --after after.ptx frx analyze --before before.csv --after after.csv --json
NCU kernel attribution excerpt
{
"kernel_attribution": {
"has_runtime_share": true,
"top_opportunities": [
{
"kernel_name": "fused_attention",
"runtime_share_pct": 38.4,
"mfu_pct": 14.2,
"roofline_region": "memory_bound",
"opportunity": "high",
"opportunity_score": 0.847
}
]
}
}Framework Abstraction Tax CLI block
FRAMEWORK ABSTRACTION TAX Score : 74/100 (high) Contributors: - Kernel launch fragmentation - Missing graph capture (opportunity) (inferred) - Unfused elementwise operations (opportunity) (inferred)
Framework Abstraction Tax appears on runtime-analysis paths, not NCU-only CSV analysis. Inferred contributors are opportunities to investigate, not proof that graph capture or fusion is disabled.
PTX API response excerpt
{
"findings": [
{
"kind": "register_spills_detected",
"severity": "HIGH",
"kernel": "matmul_kernel"
}
],
"primary_bottleneck": "ptx_register_spills",
"bottlenecks": [
{
"label": "ptx_register_spills",
"score": 0.95,
"evidence": ["local memory spill loads/stores detected"]
},
{
"label": "ptx_register_pressure",
"score": 0.72,
"evidence": ["virtual register count exceeds threshold"]
}
],
"recommendations": [
{
"id": "rec_ptx_reduce_register_pressure",
"bundle": "ptx_static",
"severity": "high",
"reason": "Register spills outrank generic register pressure."
}
],
"bundles": ["ptx_static"]
}Python helper shape
from fournex.ptx import analyze_ptx_text result = analyze_ptx_text(ptx_text) result["primary_bottleneck"] # "ptx_register_spills" result["bottlenecks"] # ranked PTX bottleneck list result["recommendations"] # ranked recommendation cards result["bundles"] # includes "ptx_static"
SDK integration
When collect wraps your training script, it sets FRX_AUTO_PERSIST=1 and injects the output path. The SDK hooks emit events automatically if you use the provided context managers or callbacks.
PyTorch training loop
from fournex import AutopilotSession
session = AutopilotSession.from_env() # reads FRX_* env vars
for epoch in range(num_epochs):
for batch in dataloader:
with session.step(step_id=global_step, step_kind="train"):
with session.dataloader_span():
batch = next_batch() # already inside dataloader loop
with session.forward_span():
loss = model(batch)
with session.backward_span():
loss.backward()
with session.optimizer_span():
optimizer.step()
global_step += 1
session.flush()If you already use torch.profiler, you can skip the SDK entirely and point --artifact-dir at the directory where the profiler writes its Chrome-format trace. The CLI will import and analyze it automatically.
Profiler-only workflow
# In your training script, write the profiler trace to frx-job-run/
profiler = torch.profiler.profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
on_trace_ready=torch.profiler.tensorboard_trace_handler("frx-job-run"),
)
# Then collect and generate the LLM brief
frx collect --explain -- python train.py
# If the trace is written somewhere else, pass that directory explicitly
frx collect --artifact-dir gpu-job-run-tiny-kernels --explain -- python train.pyNote on data richness
SDK instrumentation produces the richest data: exact step boundaries, DataLoader wait times, and H2D copy spans are recorded precisely. Profiler-only mode reconstructs these from Chrome trace heuristics and may have lower confidence on some bottleneck types.
Static CUDA inspector
POST /cuda/static-inspectInspects one or more .cu or .cuh source files and returns per-kernel heuristic analysis without compiling or running the code. Useful for surfacing configuration risks early in the development cycle.
The antipattern library is YAML-driven: 22 rules across memory, synchronization, control flow, occupancy, and tensor-core categories. Each finding includes a stable code, severity, a plain-English message, and linked recommendations.
Static detection now follows common source patterns more closely.uncoalesced_access catches strided aliases such as int idx = tid * stride; src[idx], not only inline subscript math. sync_inside_tight_loop only fires when __syncthreads() calls are actually inside the loop body, so setup syncs before an unrelated loop stay silent.
CLI
frx analyze kernel.cu frx analyze kernel.cu --gpu-model RTX4090 frx analyze kernel.cuh --output-json
Antipattern categories
| Category | Examples |
|---|---|
| Memory | Uncoalesced access, missing bounds guards, repeated global loads |
| Synchronization | Excessive barriers, suspicious sync placement |
| Control flow | Branch divergence risk and data-dependent paths |
| Occupancy | Sub-warp block sizes, non-warp-aligned block sizes, low theoretical occupancy |
| Tensor cores | Dimension and instruction patterns that block tensor-core use |
CUDA antipattern zoo
The demos/cuda_zoo/ directory contains paired bad/good kernels for learning and regression checks. Each bad kernel triggers a specific static rule; each good kernel clears it. Use these examples when you want to see what a finding looks like in actual CUDA source.
# Example: inspect the uncoalesced access pair demos/cuda_zoo/01_uncoalesced/bad.cu demos/cuda_zoo/01_uncoalesced/good.cu # Run all four bad/good pairs without a GPU ./demos/cuda_zoo/run_zoo.ps1
Request shape
POST /cuda/static-inspect
Content-Type: application/json
{
"gpu_model": "NVIDIA H100",
"files": [
{ "filename": "kernel.cu", "content": "__global__ void my_kernel(...) { ... }" }
]
}gpu_model is optional. When supplied, the inspector looks up hardware limits from a built-in preset table (device_limits_for_gpu) and feeds them to the existing occupancy estimator so advice is calibrated to the target device. Omit it to use generic defaults.
Response fields
| Field | Description |
|---|---|
| kernels | Detected __global__ kernel signatures |
| launch_configs | Launch parameters extracted from kernel<<<grid, block>>>() call sites |
| indexing_patterns | Thread/block indexing styles detected in kernel bodies |
| memory_tags | Memory access style tags (coalesced, strided, shared, etc.) |
| atomics | Atomic and reduction operation hints |
| shared_memory_warnings | Bank conflict risk, large static allocations, suspicious barrier patterns |
| launch_advice | Heuristic safe starting configurations for grid and block dimensions |
| occupancy | Estimated occupancy when gpu_model is supplied |
Occupancy and GPU model presets
When gpu_model is provided, the backend resolves hardware limits (max threads per block, shared memory per SM, register file size, warp size) from a preset table and passes them to the occupancy estimator. Occupancy is estimated from static metadata, not measured runtime occupancy.
Heuristic caveat
Static CUDA inspection is heuristic source analysis, not compiler validation. Launch suggestions are safe starting points for benchmarking, not proven optimal configurations. Always profile with frx profile to measure actual kernel behavior, or use frx collect when you need a full application run bundle.
Eval coverage
The analysis pipeline and recommendation engine are validated by a suite of deterministic evals that run against synthetic event streams and profiler traces. No live GPU is required.
| Area | Coverage |
|---|---|
| Bottleneck fixtures | Golden event-stream fixtures for each bottleneck label; verify primary_bottleneck and user_facing_bottleneck match expected values |
| Classifier boundaries | Threshold boundary tests for input, copy, sync, memory, and shape classifiers; labels must flip at documented signal thresholds |
| Confidence calibration | input_bound confidence checked at signal levels above and below the high-confidence threshold |
| Recommendation evals | Expected recommendation keys must be present; negative checks confirm absent recommendations when signal is below threshold |
| launch_bound evals | Kernel count, median duration, and small-kernel fraction combinations that should and should not trigger launch_bound |
| ExperimentRunner speedup | Deterministic synthetic summaries confirm winner selection, noise filtering, and quality gate rejection |
All assertions are data-driven: changing a classifier threshold or recommendation rule automatically updates which evals pass. Run pytest backend/tests/python -q to execute the full suite.
Nsight Compute Workflow
frx profilefrx ncu-commandUse frx profile for the normal path: it runs NCU when needed, analyzes the result, and prints the fix list in one command. Use frx ncu-command only when you want to copy the raw Nsight Compute command into CI or a separate shell.
Preset list
frx ncu-command --list frx ncu-command --list --json
| Preset | Use it for | Primary evidence |
|---|---|---|
| memory | DRAM bandwidth, cache hit rates, uncoalesced loads, and memory-related stalls | DRAM throughput, L1/L2 hit rates, global load sectors/request, memory throttle and scoreboard stalls |
| tensor | Tensor core utilization with basic occupancy and issue context | Tensor pipe activity, achieved occupancy, issue slot utilization |
| occupancy | Launch resource limits that explain low achieved occupancy | Warps active, block size, registers per thread, static and dynamic shared memory |
| stalls | Warp stall reasons across memory, synchronization, and scheduler pressure | Memory throttle, long scoreboard, barrier, wait, dispatch, not-selected, eligible-warps-per-scheduler, scheduler-active, and issue-active metrics |
| full | Broad capture when you do not know the bottleneck yet | Union of memory, tensor, occupancy, and stalls presets, including scheduler metrics |
Recommended capture path
# Simple path: run, analyze, and generate the LLM brief in one command frx profile --preset full --explain -- python train.py # Analyze an existing NCU export and generate the LLM brief frx profile --ncu ncu_report.csv --explain # Manual path: generate the NCU command, then analyze the CSV frx ncu-command full --output ncu_report.csv -- python train.py # Focus on memory evidence frx ncu-command memory --output ncu_memory.csv -- python train.py # Limit capture to a kernel name and a small launch window frx ncu-command stalls \ --kernel-name "my_kernel" \ --launch-skip 10 \ --launch-count 5 \ --output ncu_stalls.csv \ -- python train.py # Then analyze the CSV locally frx profile --ncu ncu_report.csv --explain frx analyze ncu_report.csv frx analyze ncu_report.csv --json
frx profile --ncu ncu_report.csv and frx analyze ncu_report.csv both validate the CSV before reporting. Missing metrics produce partial-evidence diagnostics instead of zero-filled results, and malformed CSVs exit nonzero with a clear parsing error.
WSL2 and WDDM caveat
Nsight Compute hardware counters can be restricted under WSL2 or Windows WDDM driver mode. If NCU reports missing or inaccessible performance counters, run on a Linux host or a Windows setup with supported counter access before treating the missing metrics as workload behavior.
NCU analysis
POST /ncu/analyzeIngests Nsight Compute (NCU) CSV exports and returns structured hardware bottleneck analysis. Both long-format (Metric Name, Metric Value rows) and direct-column layouts produced by different NCU export modes are accepted.
Supported metrics
| Metric group | NCU counter (example) | Used for |
|---|---|---|
| Warp stall reasons | smsp__warp_issue_stalled_* | warp_stall_memory and warp_stall_sync labels |
| DRAM throughput | dram__throughput.avg.pct_of_peak_sustained_elapsed | memory_bandwidth_bound label |
| L1 cache hit rate | l1tex__t_sector_hit_rate.pct | l1_cache_thrashing label |
| L2 cache hit rate | lts__t_sector_hit_rate.pct | l2_cache_thrashing label |
| Global load coalescing | l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_ld | uncoalesced_access label |
| Tensor core utilization | sm__inst_executed_pipe_tensor_op_hmma.avg | tensor_core_underutilized label |
| Issue slot utilization | sm__issue_active.avg.pct_of_peak_sustained_active | low_issue_efficiency label |
| Scheduler utilization | smsp__warps_eligible.avg.per_cycle_active, smsp__warps_active.avg.pct_of_peak_sustained_active | low_warp_scheduler_utilization label in stalls and full presets |
| Occupancy / launch metadata | achieved_occupancy.pct, sm__warps_active.avg | launch_efficiency dimension in comparison scorecard |
| Kernel duration | duration, gpu__time_duration_sum | runtime_share_pct and time-weighted kernel opportunity scoring |
Recommended NCU command
frx profile --preset full --explain -- python train.py frx profile --preset memory --explain -- python train.py frx profile --ncu ncu_report.csv --explain frx ncu-command full --output ncu_report.csv -- python train.py frx ncu-command memory --output ncu_memory.csv -- python train.py frx ncu-command occupancy --launch-count 10 --output ncu_occupancy.csv -- python train.py
NCU prefixes its CSV output with ==PROF== preamble lines. These are stripped automatically before parsing. Extra metric columns beyond the documented set are silently ignored, and missing metrics are reported as partial evidence instead of being treated as zeros. Malformed CSV input fails clearly in frx analyze ncu_report.csv.
NCU 2026.x CSV format
Nsight Compute 2026.x changed the default CSV shape from tall format, one metric per row, to wide format, one kernel per row with metrics as columns. Fournex handles both automatically, including the final done line NCU writes to stdout after the CSV. If an older CSV-producing wrapper stops parsing, check the NCU version and export mode first.
Kernel attribution
When NCU includes per-kernel duration, Fournex adds kernel_attribution to the NCU result. The top opportunities list ranks kernels by the work most likely to matter, not just by the worst-looking individual metric.
result["kernel_attribution"]["top_opportunities"]
# [
# {
# "kernel_name": "fused_attention",
# "runtime_share_pct": 38.4,
# "mfu_pct": 14.2,
# "roofline_region": "memory_bound",
# "opportunity": "high",
# "opportunity_score": 0.847
# }
# ]| Field | Meaning |
|---|---|
| opportunity_score | runtime_share x severity x mfu_gap when duration is available; severity-only fallback otherwise |
| has_runtime_share | true when NCU provided Duration and scores are time-weighted |
| runtime_share_pct | Percent of total measured kernel time attributed to this kernel |
| mfu_pct | Model FLOP utilization estimate for compute opportunity context |
| roofline_region | Coarse region such as memory_bound or compute_bound |
| opportunity | Human label such as high, medium, low, or none |
Memory-bound kernels use high severity, low-MFU compute kernels retain meaningful opportunity, and well-utilized kernels score near zero. If has_runtime_share is false, do not compare scores across kernels as a time-weighted priority list.
Request
POST /ncu/analyze
Content-Type: application/json
{
"csv_text": "...", // raw NCU CSV text
"gpu_model": "NVIDIA H100" // optional — passed to occupancy estimator
}Detected bottleneck labels
| Label | Signal |
|---|---|
| memory_bandwidth_bound | DRAM throughput utilization exceeds saturation threshold |
| warp_stall_memory | Memory stalls dominate the warp stall breakdown |
| warp_stall_sync | Sync stalls dominate the warp stall breakdown |
| l1_cache_thrashing | L1 hit rate below 40% |
| l2_cache_thrashing | L2 hit rate below 50% when the L1-heavy cache-miss signal is absent |
| uncoalesced_access | Global load sectors per request above 4 |
| tensor_core_underutilized | Tensor core utilization low when FP16/BF16 work is present |
| occupancy_limited | Measured achieved occupancy is below threshold |
| occupancy_limited_by_registers | Low occupancy with registers identified as the limiting launch resource |
| occupancy_limited_by_shared_memory | Low occupancy with shared memory per block limiting residency |
| occupancy_limited_by_block_size | Low occupancy from too few resident threads or blocks per SM |
| low_warp_scheduler_utilization | Few eligible warps per scheduler cycle or low scheduler active percentage |
| low_issue_efficiency | Issue slot utilization (IPC proxy) below threshold |
| insufficient_ncu_data | Not enough metrics in the CSV to classify |
Warp stall breakdown
NCU exports per-reason stall metrics (e.g., smsp__warp_issue_stalled_*). The analyzer sums all stall categories, identifies the dominant reason, and maps it towarp_stall_memory or warp_stall_sync. The full percentage breakdown is returned in the response for all detected stall categories.
memory_stall_fraction is magnitude-based: a kernel where 5% of warp cycles are stalled on memory reports 0.05, not 1.0. This lets thresholds be set in natural units rather than normalized fractions.
Memory access diagnosis
Cache and coalescing issues are reported separately so the first fix is clearer. l1_cache_thrashing fires when L1 hit rate is below 40%. l2_cache_thrashing fires when L2 hit rate is below 50% and the L1-heavy miss signal is absent.uncoalesced_access fires when global loads average more than 4 sectors per request.
Occupancy diagnosis
Occupancy bottlenecks use measured achieved occupancy first when NCU provides it. Launch-resource metrics then explain why occupancy is low, mapping register pressure, shared memory per block, or insufficient resident threads/blocks to the specific occupancy_limited_by_* labels. If measured occupancy is absent, Fournex falls back to launch-resource estimates when those inputs are available.
Multi-kernel aggregation
Real NCU exports often contain partial metrics across multiple kernels. Aggregation ignores missing metrics instead of treating them as zeros, so a kernel without a counter does not dilute the average for that counter.
| Case | Behavior |
|---|---|
| Missing metric on a kernel | Excluded from that metric average |
| L2-only kernel | Counts as having NCU data |
| Warp stall counters present on some kernels | Stall fractions average only over kernels with stall data |
| Summary field | kernels_with_warp_stall_data reports the stall-sampled kernel count |
Recommendation integration
NCU signals are wired into the existing recommendation engine. Rules cover DRAM saturation, L1 and L2 cache pressure, uncoalesced global loads, tensor core underuse, occupancy limits, and warp efficiency. Recommendations follow the same ranked card format as other bottleneck types and appear alongside CLI-based recommendations when both data sources are present.
PTX analysis
POST /ptx/analyzePure static analysis of PTX — NVIDIA's virtual ISA, the intermediate representation between CUDA C++ and machine code. No GPU, driver, or NVIDIA tooling is required beyond the PTX text itself.
analyze_ptx_text() now emits actionable diagnosis fields in addition to raw findings: bottlenecks, primary_bottleneck, secondary_bottlenecks, recommendations, and bundles.
Request
POST /ptx/analyze
Content-Type: application/json
{ "ptx_text": "..." } // raw PTX from nvcc --ptx or cuobjdump --ptxPer-kernel analysis fields
| Field | Description |
|---|---|
| register_count | Virtual register count; 64-bit types counted as 2 |
| spills_detected | True if .local memory depot usage is present |
| spill_loads | Number of .local load instructions |
| spill_stores | Number of .local store instructions |
| instruction_mix | Counts per category across 17 instruction types |
| branch_count | Conditional branch instruction count |
| has_back_edge | Whether a loop back-edge (loop) was detected |
| capability_flags | Detected capabilities: fp64, tensor_cores, atomics, sfu |
Instruction categories
| Category | Covers |
|---|---|
| fp32 / fp64 / fp16 | Float arithmetic by precision |
| integer | Integer arithmetic |
| global_load / global_store | Global memory accesses |
| shared_load / shared_store | Shared memory accesses |
| local_load / local_store | Local memory (spill region) accesses |
| tensor | wmma / mma tensor core instructions |
| sfu | sin, cos, rcp, sqrt approximations |
| atomic | Atomic read-modify-write operations |
| control_flow | branch, call, return |
| conversion / comparison | Type conversions and predicate comparisons |
| other | All remaining instructions |
PTX bottleneck labels
| Label | Signal | Typical recommendation |
|---|---|---|
| ptx_register_spills | Local-memory spill loads or stores detected | rec_ptx_reduce_register_pressure |
| ptx_register_pressure | Virtual register count exceeds pressure threshold | rec_ptx_reduce_register_pressure |
| ptx_global_memory_heavy | Global memory instructions dominate the static mix | rec_ptx_stage_global_memory |
| ptx_fp64_usage | FP64 instructions detected | rec_ptx_reduce_fp64 |
| ptx_branch_divergence_risk | High conditional branch density or branch-heavy control flow | rec_ptx_reduce_branch_divergence |
Findings
| Finding | Severity | Trigger |
|---|---|---|
| register_spills_detected | HIGH | .local depot present (load or store count > 0) |
| very_high_register_count | HIGH | Virtual register count >= 64 |
| high_register_count | MEDIUM | Virtual register count >= 48 |
| fp64_detected | MEDIUM | fp64 instructions present |
| high_global_memory_ratio | MEDIUM | global_load + global_store > 30% of all instructions |
| high_branch_count | MEDIUM | Branch count exceeds threshold |
| special_function_ops | LOW | SFU instructions present |
| no_shared_memory_usage | LOW | Zero shared memory load or store instructions |
| tensor_ops_detected | LOW | wmma / mma instructions present |
| atomics_detected | LOW | Atomic instructions present |
Virtual vs. physical registers
PTX register counts are virtual and pre-register-allocation. The CUDA compiler assigns physical registers during PTX-to-SASS compilation and may use fewer than the virtual count suggests. When NCU data is also available, hardware-measured register and occupancy values take precedence over PTX estimates in the comparison scorecard.
Implementation comparison
frx comparePOST /comparefrx compare compares two CUDA source files across source, PTX, and NCU layers and produces a scored diff with a verdict. Start source-only, then add PTX or NCU evidence when you need stronger conclusions.
CLI
frx compare baseline.cu optimized.cu frx compare baseline.cu optimized.cu --with-ptx --with-ncu frx compare baseline.cu optimized.cu --ncu-a baseline.csv --ncu-b optimized.csv frx compare baseline.cu optimized.cu --gpu-model H100 frx compare baseline.cu optimized.cu --build-flags "-DBUILD_EXECUTABLE" frx compare baseline.cu optimized.cu --json
Output sections
| Section | Meaning |
|---|---|
| Winner | Verdict and scores when enough data is available |
| Resolved in B | Finding codes that disappeared in the optimized implementation |
| Regressions in B | New findings that appeared and should block blind promotion |
| Improved / Regressed | Scorecard dimensions where B wins or A still wins |
| Root causes in A | Cross-layer diagnoses from reconcile_evidence() with confidence labels |
| Still unknown | Specific missing evidence, scaled to the layers you did not provide |
| Upgrade hints | The exact next flag to add, such as --with-ptx or --with-ncu |
Evidence escalation
--with-ptx compiles both sides with nvcc -ptx and unlocks register-efficiency evidence.--with-ncu compiles runnable executables and captures NCU evidence for DRAM, cache, tensor-core, occupancy, and stall dimensions. Use --ncu-a and --ncu-b when you already have CSVs and do not want the compare command to run profiling.
Missing evidence
After root-cause diagnoses, frx compare prints a Missing evidence section for low-confidence findings. It tells you which metrics would confirm the diagnosis, how confidence would upgrade if collected, and the narrow NCU command to run.
-- Missing evidence --
inefficient_global_memory_access
confidence: low-medium -> medium-high if confirmed
collect: l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_ld
ncu --metrics l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_ld \
--csv ./your_app > missing_memory.csv
Fallback:
ncu --set full --csv ./your_app > full.csvThe loop is: static analysis suspects a problem, Missing evidence tells you exactly what to collect, you collect it, then rerun frx compare --ncu-a baseline.csv --ncu-b optimized.csv to move the diagnosis toward confirmed evidence.
API and integration callers can use what_evidence_is_missing(static=..., ptx=..., ncu=...) for the same JSON-friendly data.
Request
POST /compare
Content-Type: application/json
{
"a": {
"label": "baseline",
"cuda_source": "...", // optional
"ptx_text": "...", // optional
"ncu_csv": "..." // optional
},
"b": {
"label": "optimized",
"cuda_source": "...",
"ptx_text": "...",
"ncu_csv": "..."
}
}Response structure
| Field | Description |
|---|---|
| static_diff | Structural differences: indexing patterns, shared memory use, atomics, bank conflict risk, findings resolved or introduced |
| ptx_diff | Register count delta, spill resolution or introduction, per-category instruction mix deltas, findings that changed |
| ncu_diff | Hardware metric deltas: DRAM throughput, L1/L2 hit rates, issue slot utilization, occupancy, warp stall shifts |
| scorecard | Four efficiency dimensions scored 0-1 for each side |
| verdict | overall_winner (a / b / tie), weighted scores, per-dimension winners |
| tradeoffs | Warnings where one dimension improves while another regresses, such as lower register pressure with worse achieved occupancy |
Scorecard dimensions
| Dimension | Weight | Based on |
|---|---|---|
| register_efficiency | 20% | Virtual register count and spill penalty; NCU physical count overrides when available |
| memory_efficiency | 30% | DRAM throughput utilization, L1/L2 cache hit rates, shared memory presence |
| compute_efficiency | 30% | Issue slot utilization (IPC proxy), tensor core usage, dominant warp stall reason |
| launch_efficiency | 20% | Theoretical occupancy vs. achieved, launch configuration quality from static analysis |
Scoring notes
When spills are detected, register_efficiency applies a penalty multiplier: reg_score = raw_score * (1 - SPILL_PENALTY). When NCU data is provided for a side, hardware-measured register and occupancy values override PTX-derived estimates for that side's register_efficiency and launch_efficiency dimensions. A side with NCU data is therefore scored on measured evidence while a PTX-only side is scored on static estimates.
Tradeoff warnings
Before/after comparisons include tradeoffs when evidence moves in conflicting directions. Current warnings cover cases such as register pressure or spills improving while achieved occupancy regresses, occupancy improving while register pressure regresses, and large occupancy shifts that should be checked against wall-clock time.
Sample verdict
{
"overall_winner": "b",
"score_a": 0.61,
"score_b": 0.78,
"dimensions": {
"register_efficiency": "b",
"memory_efficiency": "b",
"compute_efficiency": "b",
"launch_efficiency": "tie"
}
}Explain
frx explainfrx explain turns an NCU CSV or PyTorch training run directory into a compact optimization brief. Use it when you already have an artifact and want to generate the brief later. For a one-command workflow, prefer frx profile --explain or frx collect --explain.
# NCU CSV with optional CUDA source frx explain report.csv frx explain report.csv --source kernel.cu frx explain report.csv --out explain_out # Existing PyTorch training telemetry run frx explain runs/my-run --out explain_out # Preferred when you are profiling or collecting now frx profile --ncu report.csv --explain frx collect --explain -- python train.py
NCU prompts include secondary issues, ranked top kernels for multi-kernel workloads, roofline region plus MFU, and the active occupancy limiter when present. Training prompts are bottleneck-specific and include measured runtime evidence.
Outputs
| File | Purpose |
|---|---|
| frx_summary.txt | Human summary of the primary diagnosis, key evidence, and top fixes |
| frx_llm_prompt.txt | LLM-ready optimization brief with EXPECTED IMPROVEMENT, validation targets, and re-profile command |
| frx_evidence.json | Machine-readable frx_explain_v0 result for dashboards or custom tooling |
Pipeline
frx explain report.csv -> auto-detect directory vs .csv -> analyze NCU counters or training telemetry -> build the bottleneck-specific result -> render frx_summary.txt -> render frx_llm_prompt.txt -> render frx_evidence.json frx profile --ncu report.csv --explain -> analyze once -> reuse ncu_result -> render the same three files
Bench
frx benchfrx bench compiles two CUDA source files and wall-clock benchmarks them side by side. It is meant for quick before/after checks after you implement a kernel change.
frx bench bad.cu good.cu frx bench bad.cu good.cu --with-ncu frx bench bad.cu good.cu --gpu-model RTX5060 frx bench bad.cu good.cu --runs 20 --warmup 5
Pipeline
frx bench bad.cu good.cu -> compile_kernel x2 with nvcc -> time_binary x2 with warmup + measured runs -> optional profile_with_ncu x2 -> bench_compare -> print speedup table + bottleneck diff
Output
| Field | Meaning |
|---|---|
| before / after timing | median_ms, min_ms, max_ms, stdev_ms, runs, and warmup |
| speedup_x | before median divided by after median; greater than 1 means after is faster |
| ncu_diff | Optional bottleneck and metric diff when --with-ncu is enabled |
| compile_errors | Compilation failures captured per side |
Timing requirement
frx bench uses wall-clock process timing. The benchmark binary must call cudaDeviceSynchronize() before exit or the host process may finish before GPU work is complete.
Evidence reconciliation
POST /reconcileReconciliation merges signals from any combination of source, PTX, NCU, and profiler layers into unified cross-layer diagnoses. It is useful for dashboards, IDE integrations, or review tools that want one confidence-ranked explanation instead of separate layer-specific findings.
POST /reconcile
Content-Type: application/json
{
"static": { "...": "result from CUDA source inspection" },
"ptx": { "...": "result from PTX analysis" },
"ncu": { "...": "result from NCU CSV analysis" },
"profiler": { "...": "result from runtime profiling" }
}Response shape
| Field | Description |
|---|---|
| diagnoses | Unified cross-layer diagnoses with confidence labels such as medium, high, or confirmed |
| unreconciled | Layer-specific findings that did not yet have enough supporting evidence from other layers |
| layers_available | Which of source, PTX, NCU, and profiler evidence were present in the request |
Confidence rises as independent layers agree. A source-only signal may be medium confidence; source plus PTX or NCU can move to high; strong multi-layer agreement can become confirmed.
Known gaps
These are current constraints to account for when designing a profiling workflow or interpreting results.
| Gap | Impact |
|---|---|
| frx bench uses wall-clock timing | The benchmark binary must call cudaDeviceSynchronize() before exit for accurate GPU timing. |
| NCU profiling on Windows WDDM | Hardware-counter profiling often requires administrator privileges or switching supported GPUs to TCC mode. |
| RTX 5060 / sm_120 PC sampling | PC-sampling metrics such as smsp__pcsamplingdata_* are not available on RTX 5060-class sm_120 devices. |
| No Roofline or MFU in bench v0 | Bench reports wall-clock speedup and optional bottleneck diffs, but not Roofline placement or model FLOP utilization. |
| Framework Abstraction Tax V1 calibration | Severity bands are reasoned thresholds and are not yet validated against measured speedups. |
| Framework tax graph/fusion state | Graph capture and fusion contributors are inferred opportunities, not asserted facts from explicit framework state detection. |
| Framework tax on NCU-only paths | NCU-only analysis returns no framework_abstraction_tax because it needs runtime profiler windows. |
Validation
What Fournex has confirmed in automated testing. CLI coverage now includes frx init, frx profile, direct frx analyze file inputs, frx compare,frx explain, frx bench, reconciliation, GPU model detection, NCU validation cards, and YAML-driven CUDA rules. Offline checks run without a live GPU unless a live NCU or bench path is explicitly requested.
| Check | Result |
|---|---|
| Package integrity | v0.1.4 wheel includes recommendations/catalog.yaml and recommendations/rules.yaml — confirmed after the pyproject.toml package-data fix in v0.1.4 |
| Clean install | frx smoke-test passes from a fresh pip install with no manual venv patching |
| Init command | frx init covers dependency checks, GPU model detection, training-script discovery, patch prompting, and already-instrumented script skips |
| Profile command | frx profile covers live NCU execution, existing --ncu CSV analysis, --ptx static analysis, presets, JSON output, kernel filtering, launch-window options, and --explain output |
| Collect explain workflow | frx collect --explain generates the LLM brief from the newly created run directory after collection completes |
| CLI single-input detection | frx analyze auto-detects .cu/.cuh, .ptx, Nsight Compute .csv, run directories, and zip bundles |
| Compare CLI | frx compare covers source-only comparison, --with-ptx, --with-ncu, pre-existing --ncu-a/--ncu-b CSVs, --build-flags, and --json output |
| Explain CLI | frx explain auto-detects NCU CSVs and training run directories, then emits frx_summary.txt, a bottleneck-specific frx_llm_prompt.txt, and frx_evidence.json |
| Bench CLI | frx bench compiles two .cu kernels, discards warmup runs, reports speedup_x, and can add optional NCU bottleneck diffs |
| Evidence reconciliation | 33 tests cover all 6 cross-layer diagnoses, confidence escalation, unreconciled tracking, false-positive guards, and the /reconcile API stub |
| Architecture-aware scoring | GPU model calibration covers sm_86 through sm_120, including Blackwell consumer sm_120, and product-name detection maps known NVIDIA GPU names to model keys |
| CUDA antipattern rules | 22 YAML-driven rules across memory, synchronization, control flow, occupancy, and tensor-core categories replace hardcoded static-detection conditions |
| CLI output and errors | Human output includes primary bottleneck and top recommendation; JSON output parses and includes mode; bad paths and unsupported file types exit nonzero with clear messages |
| NCU memory diagnostics | Regression tests cover L1-only thrashing, L2-only thrashing, sectors-per-request CSV parsing and score calibration, coalesced-kernel false-positive prevention, and uncoalesced recommendation routing |
| Recommendation validation | NCU recommendations thread validation_steps through the engine, attach current_value when measured, include estimated speedup ranges, and render exact Validate commands |
| FastAPI optional dependency | Comparison endpoint tests skip cleanly when httpx is not installed |
| NCU / PTX / comparison golden | Offline golden fixture passes: memory-bound spilling baseline identified correctly, optimized implementation ranked higher across all four efficiency dimensions |
| Classifier and eval suite | 131 tests passing: bottleneck fixtures, classifier boundary tests, confidence calibration, recommendation evals, launch_bound combinations, and ExperimentRunner speedup |
Scope of Fournex today
Fournex is a diagnostic and recommendation tool, not an automatic CUDA optimizer. It identifies bottlenecks, surfaces evidence, and ranks fix candidates — but it does not rewrite kernels or apply changes automatically. Real speedup must be confirmed by re-profiling after code changes. Treat recommendations as informed starting points, not proofs.