HomeDocs

Fournex — CLI Reference

Developer documentation for the frx CLI. Covers installation, all CLI subcommands, bundle layout, and how the analysis pipeline works.

v0.2.7

What's new

  • frx init now checks Python, PyTorch, CUDA, GPU model, and Nsight Compute, then recommends the right workflow.
  • frx init --patch train.py can insert the training instrumentation snippet after a confirmation prompt.
  • frx profile --ncu report.csv --explain prints the normal report and writes frx_llm_prompt.txt in one command.
  • frx collect --explain -- python train.py generates the training LLM brief from the newly created run directory.
  • GPU model detection now maps names like NVIDIA H100 80GB HBM3, NVIDIA A100-SXM4-80GB, and NVIDIA RTX 5060 to architecture profiles automatically.

Installation

Install the package from the repo root. Python 3.11+ is required. PyTorch is optional; the CLI works without it for bundle and static-file analysis. It is only needed when the SDK instruments a live training run.

pip install fournex

This registers frx as the short executable and fournex as the long executable through the existing package entry points. Start with:

frx init

Quickstart

Start with frx init. It checks the local environment, detects the GPU model when possible, finds likely training scripts, and prints the next command for either CUDA kernel profiling or PyTorch training telemetry.

1Initialize

Run frx init to check Python, PyTorch, CUDA, GPU model, and ncu before collecting evidence.

2Collect evidence

Use frx profile for NCU/PTX kernel work or frx collect for PyTorch training telemetry.

3Generate the brief

Add --explain to profile or collect to write frx_summary.txt, frx_llm_prompt.txt, and frx_evidence.json.

# First run: checks tools and recommends the right workflow
frx init

# Training path: collect telemetry and generate the LLM brief
frx collect --explain -- python train.py

# NCU path: analyze an existing report and generate the LLM brief
frx profile --ncu ncu_report.csv --explain

# Live profile with explain output
frx profile --explain -- python train.py

# No GPU available: inspect PTX statically
frx profile --ptx kernel.ptx

# Analyze locally (run-<id> is printed by collect)
frx analyze runs/run-<id>

# Upload by dragging runs/run-<id>.zip onto fournex.com/analyze

# Optional: let autopilot sweep configs and find the fastest safe candidate
frx tune --safe --max-trials 12 -- python train.py

Command index

The frx CLI now has 11 subcommands. First-time setup starts with init. Most kernel work starts with profile, analyze, compare, explain, or bench; collect remains the path for full application run bundles.

CommandPurpose
initGuided first-run setup; check dependencies, detect GPU model, and suggest or patch instrumentation
collectWrap and run a PyTorch workload; produce a run bundle
analyzeLoad a bundle, CUDA file, PTX file, or NCU CSV and print a diagnosis
doctorCheck runtime dependencies
smoke-testSynthetic end-to-end pipeline validation
ncu-commandPrint an NCU command for Fournex metric presets
profileRun NCU and print a full bottleneck + recommendation report
tuneSafe autopilot: sweep configs and recommend the fastest
compareDiff two CUDA source files or evidence files; report what improved/regressed
explainGenerate an LLM-ready optimization brief from an NCU CSV or training run directory
benchCompile and wall-clock benchmark two .cu kernels side-by-side

init

frx init

Guided first-run onboarding. It checks the local environment, detects the active GPU model through the architecture-profile matcher, finds likely training scripts, and prints the command that gets you to frx_llm_prompt.txt.

frx init
frx init --patch train.py

Checks performed

CheckWhat it reports
PythonInterpreter version and basic runtime availability
PyTorchInstalled version when importable
CUDACUDA availability and GPU name when PyTorch can see it
GPU modelRecognized architecture key such as h100, a100, or rtx5060
ncuWhether Nsight Compute is on PATH for kernel profiling
Training scriptsLikely files such as train.py, train_*.py, main.py, or run.py

Patch mode

frx init --patch FILE inserts the recommended training instrumentation after the last top-level import, but only after a confirmation prompt. If the script already imports and initializes Fournex, it is skipped cleanly.

import fournex as frx
frx.init(job_name="my-run")

for step, batch in enumerate(dataloader):
    with frx.step_context(step=step, batch=batch, model=model):
        # your existing training step here
        pass

profile

frx profile

The simplest command for engineers working on CUDA kernels. It accepts either a workload to run, an existing Nsight Compute CSV, or a PTX file, then prints the bottleneck and the next fix to try.

frx profile --explain -- python train.py
frx profile --ncu report.csv --explain
frx profile --ptx kernel.ptx

Options:
  --preset memory|tensor|occupancy|stalls|full
  --out save.csv
  --kernel-name FILTER
  --launch-skip N
  --launch-count N
  --gpu-model NAME    Override automatic GPU model detection
  --explain           Also write frx_summary.txt, frx_llm_prompt.txt, frx_evidence.json
  --json

When --explain is set, profile reuses the in-memory analysis result and writes the same explain files as frx explain. No second analysis pass is needed.

Choose the mode

SituationCommandUse this when
Run NCU nowfrx profile --explain -- python train.pyYou are on the GPU machine and ncu is on PATH.
Analyze a CSVfrx profile --ncu report.csv --explainA teammate, CI job, or previous run already produced the NCU export.
Analyze PTXfrx profile --ptx kernel.ptxYou want static risk signals without running on the target GPU.

Report layout

SectionWhat to look at
VERDICTPrimary bottleneck, secondary bottlenecks, and confidence note.
MEASURED METRICSEvery metric with [!!], [ !], [ok], or [--] plus the threshold that explains the status.
BOTTLENECKS DETECTEDAll detected bottlenecks with numeric score and ASCII score bar.
RECOMMENDATIONSEvery ranked fix with why, actions, validation steps, caveats, tier, score, and triggering rule.
NEXT STEPSThe exact re-run command and the top priority fix name.

Presets

Use full when you do not know the bottleneck yet. Use a narrower preset when NCU overhead matters or you are iterating on one class of issue.

frx profile --preset full --explain -- python train.py
frx profile --preset memory --kernel-name "my_kernel" --launch-count 5 -- python train.py
frx profile --preset occupancy --gpu-model H100 --out ncu_occupancy.csv -- python train.py

collect

frx collect

Runs a workload subprocess, samples GPU metrics in the background, imports profiler artifacts, runs the analysis pipeline, and writes a self-contained run bundle.

frx collect [OPTIONS] -- COMMAND [ARGS...]

Options:
  --name NAME              Human-readable job name (default: frx-run)
  --out DIR                Root output directory (default: runs)
  --run-id ID              Override auto-generated run ID
  --artifact-dir DIR       Import artifacts from DIR after the workload exits.
                           May be repeated. Default: ./frx-job-run
  --no-profiler-import     Skip importing profiler_trace.json from artifact dirs
  --sample-interval-ms N   nvidia-smi polling interval in ms (default: 1000)
  --config FILE            Optional run_config.yaml to merge into bundle config
  --explain                Generate frx_summary.txt, frx_llm_prompt.txt, and frx_evidence.json
  --no-zip                 Skip creating the zip archive

Use --explain when you want the training workflow to end at the LLM handoff. After the run bundle is created and analyzed, Fournex generates the brief directly from that new run directory.

Artifact directory gotcha

If your workload writes profiler_trace.json somewhere other than frx-job-run/, pass that directory with --artifact-dir. Otherwise the trace exists on disk but will not be copied into the run bundle.

# Workload writes gpu-job-run-tiny-kernels/profiler_trace.json
frx collect \
  --name tiny-kernel-launch-overhead \
  --out runs \
  --sample-interval-ms 100 \
  --artifact-dir gpu-job-run-tiny-kernels \
  --explain \
  -- python tiny_kernel_launch_overhead.py --output-dir gpu-job-run-tiny-kernels

What it does

  1. Writes run_config.yaml and injects env vars into the workload process so the SDK auto-persists events to raw/trace.jsonl.
  2. Starts a background thread that polls nvidia-smi at --sample-interval-ms into gpu_metrics.csv.
  3. Runs the workload. Stdout and stderr are tee'd to optional_logs.txt.
  4. After the workload exits, copies artifacts from --artifact-dir into the bundle (marked [imported] in the summary).
  5. Runs the analysis pipeline over raw/trace.jsonl (or the imported profiler bundle if no SDK trace exists) and writes derived/summary.json.
  6. Writes metadata.json, manifest.json, and zips the bundle.
  7. When --explain is set, writes frx_summary.txt, frx_llm_prompt.txt, and frx_evidence.json.

Environment variables injected into the workload

VariableValue
FRX_RUN_IDGenerated run ID
FRX_JOB_NAME--name value
FRX_OUTPUT_DIRAbsolute path to the run directory
FRX_RAW_TRACE_PATHraw/trace.jsonl absolute path
FRX_DERIVED_SUMMARY_PATHderived/summary.json absolute path
FRX_AUTO_PERSIST1
FRX_SAMPLE_INTERVAL_MS--sample-interval-ms value

Example output

frx collect completed
Run bundle: runs/run-a1b2c3d4e5f6
Zip bundle: runs/run-a1b2c3d4e5f6.zip

Captured (10 files):
  metadata.json
  manifest.json
  run_config.yaml
  gpu_metrics.csv
  optional_logs.txt
  raw/trace.jsonl
  derived/summary.json
  profiler/profiler_trace.json  [imported]
  frx_summary.txt
  frx_llm_prompt.txt
  frx_evidence.json

analyze

frx analyze

Unified local analysis for run bundles, PTX, CUDA source, and Nsight Compute CSV exports. No GPU or PyTorch is required for static files, NCU CSV ingestion, or bundle analysis.

frx analyze [TARGET] [OPTIONS]

Arguments:
  TARGET     Optional path to a run directory, run zip, or one evidence file:
             .ptx, .cu, .cuh, .cuda, .csv

Options:
  --scope SCOPE               run | steady_state | auto
                              Applies to run bundle analysis.
  --before PATH               Auto-detect one baseline evidence file.
  --after PATH                Auto-detect one optimized evidence file.
  --before-source PATH        Baseline CUDA source file.
  --before-ptx PATH           Baseline PTX file.
  --before-ncu PATH           Baseline Nsight Compute CSV.
  --after-source PATH         Optimized CUDA source file.
  --after-ptx PATH            Optimized PTX file.
  --after-ncu PATH            Optimized Nsight Compute CSV.
  --before-label LABEL        Label for the before side.
  --after-label LABEL         Label for the after side.
  --baseline PATH             Deprecated alias for --before.
  --optimized PATH            Deprecated alias for --after.
  --json                      Output a stable JSON wrapper.
  --output-json               Alias for JSON output.
  --gpu-model NAME            Calibrate CUDA static launch advice to a GPU model.

Examples:
  frx analyze runs/run-abc123
  frx analyze runs/run-abc123.zip
  frx analyze kernel.cu
  frx analyze kernel.cu --gpu-model RTX4090
  frx analyze kernel.ptx
  frx analyze profile.csv
  frx analyze kernel.cu --output-json
  frx analyze --before before.csv --after after.csv
  frx analyze --before before.ptx --after after.ptx
  frx analyze --before before.csv --after after.csv --json
  frx analyze --before-source before.cu --before-ptx before.ptx --before-ncu before.csv \
              --after-source after.cu --after-ptx after.ptx --after-ncu after.csv \
              --before-label baseline --after-label optimized

Supported inputs

frx analyze auto-detects code and profiler files directly. Use this when you want a quick local answer without collecting a full run bundle.

InputModeAnalyzer
Run directoryrun_bundleExisting collected run bundle analysis
.ziprun_bundleExisting zip bundle analysis
.ptxptxanalyze_ptx_text
.cu / .cuh / .cudacuda_sourceinspect_cuda_source
.csvncuanalyze_ncu_csv_text
Text containing .entry or .versionptxPTX content fallback when extension is unclear
frx analyze kernel.cu
frx analyze kernel.cu --gpu-model RTX4090
frx analyze kernel.ptx
frx analyze profile.csv
frx analyze kernel.cu --output-json

Comparison routing

For new CUDA before/after reviews, prefer frx compare baseline.cu optimized.cu. The older--before and --after flags still auto-detect one evidence file per side and remain useful for CSV-only or PTX-only comparisons. --baseline and --optimized remain deprecated aliases.

Human reports

ModeDefault terminal report
Single analysisFile type, diagnostic confidence, primary bottleneck, secondary bottlenecks, top recommendations, key evidence
CUDA sourceKernel count, launch count, findings by severity, launch advisor highlights
PTXKernel count, target, register pressure, spills, global/shared memory mix, FP64 usage, recommendations
NCUKernel count, measured bottleneck rank, DRAM/cache/issue/occupancy/stall evidence, recommendations
ComparisonWinner, score delta, dimensions won, resolved findings, new regressions, most important metric deltas

JSON wrappers

--json or --output-json prints the raw result object under a stable mode wrapper for every supported path. Single-file analysis and comparison modes use the same contract.

{"mode": "ptx", "result": ...}
{"mode": "cuda_source", "result": ...}
{"mode": "ncu", "result": ...}
{"mode": "run_bundle", "result": ...}
{"mode": "comparison", "result": ...}

Run bundle data source priority

analyze picks the best available data source in this order:

  1. derived/summary.json — pre-analyzed, preferred
  2. raw/trace.jsonl — re-analyzed on the fly
  3. profiler/profiler_trace.json + gpu_metrics.csv — imported and analyzed

Example output

--------------------------------------------------------
  GPU Autopilot - Run Analysis
  Run  : run-a1b2c3d4e5f6
  Scope: steady_state  (28 steps)
--------------------------------------------------------

VERDICT
  Primary Bottleneck : input_bound
  Internal Signal    : underutilized_gpu (symptom)
  Confidence         : high (0.88)
  Reason             : input_bound leads the ranking and matches the dominant stall summary.

EVIDENCE
  - Average DataLoader wait fraction is 0.825.
  - Run summary dominant stall type is input_bound.

PERFORMANCE SNAPSHOT
  Avg GPU Utilization : 1.3%
  Avg Memory Util     : 12.0%
  Peak Memory Pressure: 0.14
  Avg Step Time       : 207.000 ms
  Throughput          : 4.8 steps/sec
  Dominant Stall      : input_bound

TOP RECOMMENDATIONS (3 of 5)

  1. [HIGH] Increase DataLoader num_workers
     Effort: low  |  Risk: low  |  Score: 0.84
     DataLoader wait is the dominant stall ...

When underutilized_gpu is the internal top signal but a stall type (e.g. input_bound) is also present, the verdict displays the root cause. The raw internal signal is shown on the Internal Signal line.

Launch-bound traces and near-zero GPU samples

For tiny-kernel workloads, nvidia-smi sampling can report near-zero GPU utilization even when the profiler captured many CUDA kernels. Treat that as bursty GPU activity rather than proof that no GPU work ran. The launch-bound report uses profiler evidence such as kernel_count_per_step, median_cuda_kernel_duration_us, small_kernel_fraction, and stable shapes when available.

VERDICT
  Primary Bottleneck : launch_bound
  Confidence         : medium (0.65)

EVIDENCE
  - Profiler saw about 840.0 CUDA kernels per step with median duration 4.200 us.
  - GPU utilization sampling stayed low, which is expected for bursty tiny-kernel workloads.
  - Shapes were stable, so compile or CUDA graph mitigations are viable.

Zip bundle troubleshooting

If frx analyze cannot read a zip bundle, verify the archive contains at least one of derived/summary.json, raw/trace.jsonl, or profiler artifacts. Zip members with path traversal sequences are rejected automatically.

Input errors

A missing path exits nonzero with a clear path error. Unsupported file types also exit nonzero and suggest the supported inputs: run directories, run zip bundles, PTX, CUDA source, and Nsight Compute CSV.

doctor

frx doctor

Checks that all runtime dependencies are present and configured. Exits with code 0 if all checks pass, 1 if any [FAIL] lines appear.

frx doctor

Checks performed

CheckWhat it verifies
PythonPython version (always passes)
torchPyTorch importable; reports version
CUDA availabletorch.cuda.is_available(), GPU name and count
nvidia-sminvidia-smi on PATH (required for gpu_metrics.csv)
fournex.profilerSDK profiler module importable
fournex.analysisAnalysis pipeline importable
frx doctor

  [OK]    Python                               3.12.3
  [OK]    torch                                2.3.0+cu121
  [OK]    CUDA available                       NVIDIA A100 x1
  [OK]    nvidia-smi                           /usr/bin/nvidia-smi
  [OK]    fournex.profiler         importable
  [OK]    fournex.analysis         importable

All checks passed.

smoke-test

frx smoke-test

Writes a synthetic input-bound Chrome-format profiler trace, runs the full collect + analysis pipeline end-to-end in a temp directory, and verifies the bundle and diagnosis output. Useful for CI and confirming the install is working.

frx smoke-test

Checks performed:

  • Run directory and subdirs exist (raw/, derived/, profiler/)
  • derived/summary.json was generated
  • manifest.json is present
  • Zip bundle was created
  • Diagnosis produced primary_bottleneck == input_bound
  • At least one recommendation was generated
frx smoke-test

Running smoke test ...

  [PASS]  create run directory
  [PASS]  write synthetic profiler trace
  [PASS]  generate derived/summary.json
  [PASS]  manifest.json present
  [PASS]  zip bundle created
  [PASS]  primary_bottleneck == input_bound
  [PASS]  recommendations present
  [PASS]  no unexpected warnings

All smoke-test checks passed.

Tiny-kernel workload

Testing

Use this pattern to intentionally stress the launch_bound detector. The workload creates many short CUDA kernels per step so the profiler captures the overhead signal clearly. Run it with frx collect and expect the analyzer to diagnose launch_bound.

Purpose

This is a synthetic stress workload for validating the optimizer pipeline. It is not representative of a production training script — the inefficiencies are intentional.

Patterns that trigger kernel launch overhead

PatternWhy it produces many small kernels
Repeated elementwise ops in a Python loopEach op (add, mul, relu …) dispatches a separate CUDA kernel with no fusion
Many small matrix multiplicationsEach matmul is an independent kernel launch with minimal compute per launch
Python loop over tensorsThe Python loop serializes dispatches and prevents the JIT from fusing across iterations
Small batch sizeReduces arithmetic intensity so each kernel finishes in microseconds, amplifying launch overhead relative to compute

Example test script

"""tiny_kernel_launch_overhead.py — intentionally launch-bound workload."""
import os, torch
from fournex.profiler import profile_kernels

OUTPUT_DIR = os.environ.get("FRX_OUTPUT_DIR", "gpu-job-run-tiny-kernels")
STEPS      = 20
BATCH      = 4          # intentionally tiny
DIM        = 64         # intentionally tiny

device = torch.device("cuda")

with profile_kernels(output_dir=OUTPUT_DIR, steps=STEPS):
    for step in range(STEPS):
        x = torch.randn(BATCH, DIM, device=device)

        # Pattern 1 — repeated elementwise ops (no fusion)
        for _ in range(50):
            x = x + 0.01
            x = x * 1.001
            x = torch.relu(x)

        # Pattern 2 — many small matmuls in a Python loop
        for _ in range(20):
            w = torch.randn(DIM, DIM, device=device)
            x = x @ w

        # Pattern 3 — elementwise chain without compile
        x = x.sin().cos().exp().log1p()

        torch.cuda.synchronize()   # flush before next step

Collect command

frx collect \
  --name tiny-kernel-launch-overhead \
  --out runs \
  --sample-interval-ms 100 \
  --artifact-dir gpu-job-run-tiny-kernels \
  -- python tiny_kernel_launch_overhead.py

Expected diagnosis

VERDICT
  Primary Bottleneck : launch_bound
  Confidence         : high (0.85)
  Reason             : Many short CUDA kernels detected; kernel launch overhead
                       dominates compute time.

EVIDENCE
  - Profiler saw ~840 CUDA kernels per step with median duration 3-6 us.
  - Small kernel fraction > 80% (kernels under 10 us).
  - Shapes were stable across steps — compile and CUDA graph mitigations viable.
  - GPU utilization sampling near-zero (expected for bursty tiny-kernel workloads).

TOP RECOMMENDATIONS
  1. [HIGH]  Use torch.compile (reduce-overhead mode)
             Fuses elementwise chains and reduces dispatch count.
  2. [HIGH]  Fuse operations manually or use torch.vmap / torch.func
             Combine the inner loops into batched ops.
  3. [MEDIUM] Increase batch size
             More work per kernel raises arithmetic intensity.
  4. [MEDIUM] Consider CUDA Graphs
             Captures the static kernel sequence and replays without Python overhead.

Applying the mitigations

import torch

# Mitigation 1 — torch.compile (fuses elementwise ops automatically)
@torch.compile(mode="reduce-overhead")
def forward(x):
    for _ in range(50):
        x = x + 0.01
        x = x * 1.001
        x = torch.relu(x)
    return x

# Mitigation 2 — replace the matmul loop with a single batched op
# Instead of: for _ in range(20): x = x @ w
weights = torch.randn(20, DIM, DIM, device=device)
x = torch.einsum("bd,nde->be", x, weights)   # one kernel, not 20

# Mitigation 3 — increase batch size to amortize launch cost
BATCH = 256   # was 4

Verifying the fix

Re-run frx collect after applying mitigations. The diagnosis should shift away from launch_bound — kernel count per step will drop and median kernel duration will increase as operations are fused. Use frx tune --bottleneck launch_bound --no-safe to let the runner benchmark torch.compile and CUDA Graphs automatically.

tune

frx tuneAutopilot

Runs the experiment runner: captures a baseline, focuses candidate configs from a bottleneck diagnosis, validates safety before each trial, measures an explicit benchmark window, rejects quality regressions, and writes reproducible artifacts. The command remains recommendation-only; it does not rewrite your training config.

frx tune [OPTIONS] -- COMMAND [ARGS...]

Options:
  --name NAME              Job name for output directories (default: frx-tune)
  --out DIR                Root output directory (default: runs)
  --max-trials N           Max candidate configs to try (default: 12)
  --safe                   Tier-0 only: dataloader knobs (default)
  --no-safe                Also try Tier-1: batch size and mixed precision
  --time-budget-s N        Kill trial after N seconds (default: 60)
  --warmup-steps N         Steps to skip before measuring (default: 5)
  --measure-steps N        Steps to include in measurement (default: 20)
  --repeat-count N         Repeats per baseline and candidate (default: 1)
  --no-race                Disable quick candidate screening
  --race-promote-count N   Candidates promoted from race to full benchmark (default: 3)
  --race-warmup-steps N    Warmup steps for quick screening (default: 1)
  --race-measure-steps N   Measurement steps for quick screening (default: 5)
  --bottleneck LABEL       Focus candidates manually
  --min-speedup FLOAT      Minimum improvement to recommend (default: 0.08 = 8%)
  --allow-risky-actions    Allow high-risk candidates
  --no-quality-checks      Do not require quality checks for precision changes
  --max-final-loss-regression FLOAT
  --max-loss-divergence FLOAT
  --output-abs-tolerance FLOAT
  --allow-nonfinite-loss
  --sample-interval-ms N   GPU sampling interval (default: 1000)

Safety tiers

TierActionsFlagGuardrails
0 — Safenum_workers, pin_memory, prefetch_factor, persistent_workers--safe (default)Exit code, step count, throughput not zero
1 — Validatedbatch_size, AMP fp16/bf16--no-safeSame as Tier 0 + memory ratio < 90%, step time regression < 10%
2 — Riskydistributed tuning, custom kernelsNot yet implementedRequires explicit user approval

Current implementation adds allocator candidates in the safe tier and runtime candidates such as torch.compile and CUDA Graphs in the validated tier when their preconditions pass.

Staged search order

Candidates are generated in stages so the trial budget is spent efficiently — no brute-force grid across all knob combinations.

Screen   race pass    short benchmark all candidates, then promote top N
Stage 1  dataloader   num_workers × pin_memory grid + prefetch_factor variants
Stage 2  batch size   1.25×, 1.5×, 2× baseline  (--no-safe required)
Stage 3  precision    bf16 (Ampere+), fp16        (--no-safe required)

Race-stage trials are screening signals only. The final winner must still come from a full benchmark and pass the normal guard, quality, and noise checks.

Recommendations vs. tune trials

Recommendations are diagnosis-driven fix cards. They are ranked by signal strength, expected impact, effort, and risk, but they are not proof that a change already improved your workload. Tune trials are executable config candidates that the runner actually benchmarks.

SurfaceSourceUse it for
RecommendationDiagnosis + rule catalogPrioritizing what to inspect or test next
Race trialShort benchmark windowScreening candidates before full measurement
Full tune trialFull benchmark window + guardrailsChoosing the recommendation-only winner

Diagnosis-focused candidates

The runner now focuses candidates from the baseline diagnosis when it can read one from derived/summary.json. Use --bottleneck to override that focus manually.

BottleneckCandidate family
input_boundDataLoader workers, pin_memory, prefetch_factor
copy_boundPinned-memory-focused DataLoader candidates
launch_boundtorch.compile and CUDA Graphs when --no-safe is enabled
memory_pressureCUDA allocator settings, then mixed precision when --no-safe is enabled
underutilized_gpuBatch size, mixed precision, then runtime candidates when --no-safe is enabled

Pre-run safety validation

Unsafe candidates are skipped before execution. They still get a trial directory with config.yaml, metrics.json, and stderr.log explaining the rejection reason.

CheckRejects when
Risk policyCandidate is high risk and --allow-risky-actions is not set
Batch sizeMemory headroom is below the safe threshold
PrecisionCUDA is unavailable, bf16 is unsupported, or quality checks are required
CUDA GraphsShapes appear dynamic or CUDA is unavailable
torch.compileCompile is marked unsupported or dynamic behavior is incompatible

Benchmark window

Each trial writes an explicit benchmark_window.json. Metrics prefer measurement_window when per-step data is available, then fall back to steady_state and full run metrics.

benchmark_window.json
{
  "warmup_steps": 5,
  "measurement_steps": 20,
  "repeat_count": 1,
  "timeout_s": 60
}

Env vars injected per trial

Each trial subprocess receives the standard FRX_* collect vars plus these tune-specific ones. The workload reads them to configure itself — see the SDK integration section for how to wire them up.

VariableSet byPurpose
FRX_TUNE_WARMUP_STEPStune runnerSteps to skip before measurement; workload should exit early
FRX_TUNE_MEASURE_STEPStune runnerMeasurement steps requested
FRX_TUNE_MAX_STEPStune runnerTotal steps (warmup + measure); workload exits at this count
FRX_TUNE_REPEAT_COUNTtune runnerRepeat count used for noise-aware comparison
FRX_NUM_WORKERSdataloader tunerDataLoader num_workers value to use
FRX_PIN_MEMORYdataloader tuner'true' or 'false'
FRX_PREFETCH_FACTORdataloader tunerDataLoader prefetch_factor value
FRX_PERSISTENT_WORKERSdataloader tuner'true' or 'false'
FRX_BATCH_SIZEbatch size tunerAbsolute batch size to use (Tier 1)
FRX_AMP_DTYPEmixed precision tuner'bfloat16' or 'float16' (Tier 1)
FRX_TORCH_COMPILEruntime tunerEnable torch.compile when supported
FRX_TORCH_COMPILE_MODEruntime tunerCompile mode such as reduce-overhead
FRX_CUDA_GRAPHSruntime tunertry_if_static_shapes
PYTORCH_CUDA_ALLOC_CONFmemory tunerCUDA allocator configuration

Trial artifacts

runs/
  tune-<id>/
    baseline/
      config.yaml
      benchmark_window.json
      metrics.json
      stdout.log
      stderr.log
      derived/summary.json
      raw/trace.jsonl
    race/
      <candidate-id>/
        config.yaml
        benchmark_window.json
        metrics.json
        stdout.log
        stderr.log
    <candidate-id>/
      config.yaml
      benchmark_window.json
      metrics.json
      stdout.log
      stderr.log
    autopilot_report.json
    report.md

Workload integration

The workload reads the injected env vars and applies them. The minimal pattern for dataloader tuning:

import os

num_workers     = int(os.environ.get("FRX_NUM_WORKERS", "4"))
pin_memory      = os.environ.get("FRX_PIN_MEMORY", "true") == "true"
prefetch_factor = int(os.environ.get("FRX_PREFETCH_FACTOR", "2"))
persistent      = os.environ.get("FRX_PERSISTENT_WORKERS", "true") == "true"
max_steps       = int(os.environ.get("FRX_TUNE_MAX_STEPS", "0")) or None

loader = DataLoader(
    dataset,
    batch_size=batch_size,
    num_workers=num_workers,
    pin_memory=pin_memory,
    prefetch_factor=prefetch_factor if num_workers > 0 else None,
    persistent_workers=persistent and num_workers > 0,
)

for step, batch in enumerate(loader):
    if max_steps and step >= max_steps:
        break
    # ... training step ...

For AMP and batch size (--no-safe):

import torch, os

amp_dtype_str = os.environ.get("FRX_AMP_DTYPE")          # "bfloat16" | "float16" | None
amp_dtype     = getattr(torch, amp_dtype_str, None) if amp_dtype_str else None
batch_size    = int(os.environ.get("FRX_BATCH_SIZE", "32"))

with torch.autocast("cuda", dtype=amp_dtype, enabled=amp_dtype is not None):
    loss = model(batch)

Quality regression gates

A faster candidate is rejected if quality metrics regress. Loss is read from step_end.payload.loss when the workload emits it, and output drift checks are used when present in the summary quality fields.

GateDefaultFlag
Final loss vs baselineReject if worse by more than 5%--max-final-loss-regression
Trial loss divergenceReject if final loss grows more than 50%--max-loss-divergence
NaN/Inf lossReject--allow-nonfinite-loss
Output absolute driftReject above 0.005 when reported--output-abs-tolerance

Example output

frx autopilot — starting tune run tune-3f8a12b4
Workload : python train.py
Max trials: 12  |  Time budget: 60s/trial

Running baseline...
  Baseline: 4.8 steps/sec  (exit=0, steps=25)

Generated 8 candidates

Running quick race stage (1 warmup + 5 measure steps)...
  [1/8] race: dl:nw=0,pin=T ...
  [RACE]  dl:nw=0,pin=T                        +1.2%  (exit=0, steps=6)
  [2/8] race: dl:nw=2,pin=T ...
  [RACE]  dl:nw=2,pin=T                        +11.4% (exit=0, steps=6)
  [3/8] race: dl:nw=4,pin=T ...
  [RACE]  dl:nw=4,pin=T                        +19.3% (exit=0, steps=6)
  ...
Quick race promoted 3 of 8 candidates

  [1/3] full: dl:nw=4,pin=T ...
  [PASS]  dl:nw=4,pin=T                        +19.3% (exit=0, steps=25)
  [2/3] full: dl:nw=8,pin=T ...
  [PASS]  dl:nw=8,pin=T                        +18.1% (exit=0, steps=25)
  [3/3] full: amp:fp16 ...
  [FAIL]  amp:fp16                              +28.0% (exit=0, steps=25)
       ! quality regression: final loss 1.2 exceeds baseline 1 by more than 5%

Report saved: runs/tune-3f8a12b4/autopilot_report.json
Markdown report saved: runs/tune-3f8a12b4/report.md

──────────────────────────────────────────────────────────
  frx autopilot — Tune Report
  Job    : frx-tune
  Trials : 8 candidates + baseline
──────────────────────────────────────────────────────────

BASELINE
  Throughput   : 4.80 steps/sec
  Avg step     : 208.3 ms
  GPU util     : 1.3%
  Dominant stall: input_bound

TRIAL RESULTS
  dl:nw=4,pin=T                        +19.3% ✓
  dl:nw=8,pin=T                        +18.1% ✓
  dl:nw=4,pin=T                        [RACE] +19.3%  promoted to full benchmark
  dl:nw=8,pin=T                        [RACE] +18.1%  promoted to full benchmark
  dl:nw=2,pin=T                        [RACE] +11.4%  screened out by quicker candidates
  dl:nw=0,pin=T                        [RACE] +1.2%   screened out by quicker candidates

WINNER
  Config       : dl:nw=4,pin=T
  Throughput   : 5.73 steps/sec  (+19.3% vs baseline)
  Avg step     : 174.5 ms
  GPU util     : 4.1%

ENV VARS TO APPLY
  FRX_NUM_WORKERS=4
  FRX_PIN_MEMORY=true
  FRX_PERSISTENT_WORKERS=true
  FRX_PREFETCH_FACTOR=2

Applied: No — recommendation only
To apply: set the env vars above before launching your workload.

Promotion thresholds

A candidate is promoted only if it clears all of these. Noisy sub-threshold improvements are not recommended.

GuardDefaultFlag
Minimum throughput improvement≥ 8%--min-speedup
Peak GPU memory ratio< 90%
Step time regression< 10% worse than baseline
Exit code0 (clean exit)
Minimum steps captured≥ 3
Quality gatesLoss and numerics must passquality flags above

Current boundary

Repeated trials now use median throughput and a measured noise band. Interleaved ordering such as baseline A, trial, baseline B is still future comparator work.

Bundle layout

Each collect run produces one directory under --out (default runs/) and a zip of it.

runs/
  run-<id>/
    metadata.json            # Run metadata, artifact list, warnings
    manifest.json            # Included files, limited-data flag
    run_config.yaml          # Collector config + detected environment
    gpu_metrics.csv          # nvidia-smi samples (util %, memory, clocks)
    optional_logs.txt        # Combined workload stdout + stderr
    raw/
      trace.jsonl            # SDK event stream (one JSON object per line)
    derived/
      summary.json           # Pre-analyzed output — preferred by analyzer
    profiler/
      profiler_trace.json    # Chrome-format torch.profiler trace (imported)
  run-<id>.zip               # All of the above, zipped for upload

Root-layout compatibility

frx analyze also accepts zip archives that omit the run-id/ prefix, placing derived/summary.json, raw/trace.jsonl, and profiler artifacts at the archive root. Both layouts are detected automatically.

File roles

FileSourceRequired for analysis
derived/summary.jsonGenerated by collectPreferred — fastest path
raw/trace.jsonlSDK auto-persistYes, if no derived summary
profiler/profiler_trace.jsonImported from --artifact-dirFallback if no SDK trace
gpu_metrics.csvnvidia-smi pollerEnriches GPU util data
metadata.jsonGenerated by collectNo (informational)
run_config.yamlGenerated by collectNo (informational)
optional_logs.txtWorkload stdout/stderrNo (debugging)

The web analyzer scores bundle files when you upload multiple files at once. derived/summary.json scores highest (120 pts) and is used automatically when present.

Analysis pipeline

The analysis pipeline is pure Python with no GPU required. It accepts the SDK event stream or events reconstructed from a Chrome-format profiler trace, and produces a structured summary dict.

from fournex.analysis import summarize_run_with_steady_state

summary = summarize_run_with_steady_state(events)
# summary["steady_state"]["diagnosis"]["user_facing_bottleneck"]
# → "input_bound"

Summary shape

KeyDescription
event_countTotal events in the input stream
step_countSteps detected across the full run
selectorsteady_state window policy (skip_first_n, last_k)
runScope object for all steps
steady_stateScope object for warm-up-excluded steps
scope_comparisonWhether primary bottleneck changed between scopes

Each scope object contains per_step (timing breakdown per step), run_summary (aggregated metrics), bottlenecks (scored list), and diagnosis (primary bottleneck + recommendations).

Symptom vs. root cause

underutilized_gpu often scores highest (the GPU is idle) but it is a symptom, not a cause. When a stall-type bottleneck (e.g. input_bound) is also present, the diagnosis.user_facing_bottleneck field is set to that root cause. The internal top signal is preserved in diagnosis.primary_bottleneck.

{
  "primary_bottleneck":     "underutilized_gpu",   // internal top signal
  "user_facing_bottleneck": "input_bound",          // shown to users
  ...
}

Framework Abstraction Tax

Runtime summaries can include framework_abstraction_tax when profiler windows are available. The score estimates GPU idle time that is not already explained by input, copy, or sync stalls, then scales it by launch-stream fragmentation.

SignalInterpretation
score0-100 headline value; current bands are reasoned V1 thresholds, not speedup-calibrated
severityLow, medium, or high label derived from score bands
contributorsDrivers such as kernel launch fragmentation plus inferred graph-capture or fusion opportunities
inferredtrue when Fournex is suggesting an opportunity rather than asserting framework state

The key guardrail is subtraction of data-pipeline idle: an input-bound workload should not be mislabeled as framework tax just because the GPU is idle.

System components

Fournex has separate analyzers for runtime events, source, PTX, NCU CSVs, and before/after evidence. The recommendation layer sits above those analyzers so CLI, API, and web reports use the same rules and catalog entries.

ComponentRole
ncu_analysis.pyIngests wide-format Nsight Compute CSV, extracts warp stall types, DRAM/L1/L2 throughput, tensor core utilization, ISU, occupancy, launch-resource metrics, and kernel attribution. Produces NcuResult with bottleneck classification, ranked recommendations, and top kernel opportunities.
kernel_inspector.pyStatic CUDA inspector using pure-regex source analysis for memory access, thread indexing, sync hazards, data types, resource pressure, and NCU kernel duration aliases. Its source summary feeds CUDA antipattern rules and comparison scoring.
kernel_attribution.pyComputes per-kernel runtime share, MFU percentage, roofline region, opportunity label, and opportunity_score from NCU kernel summaries and architecture profiles.
ptx_analysis.pyPure-regex PTX parser for virtual register pressure, spill detection, instruction mix, memory operations, tensor operations, and control-flow risk.
reconciliation.pyMerges source, PTX, NCU, and profiler signals into 6 unified diagnoses with confidence labels. Exposed through POST /reconcile.
framework_abstraction_tax.pyComputes a conservative framework abstraction tax score from runtime summaries by isolating GPU idle not explained by input, copy, or sync stalls and scaling it by launch-stream fragmentation.
recommendations/engine.pyCombines extracted signals, YAML rules, and catalog entries. It attaches current_value to validation steps and threads estimated_speedup_pct_min/max through recommendation output.
explain.pyBuilds explain results for NCU CSVs and training run directories, then renders frx_summary.txt, frx_llm_prompt.txt, and frx_evidence.json.
bench.pyRuns compile_kernel, time_binary with warmup-discarded wall-clock timings, optional profile_with_ncu, and bench_compare. Produces frx_bench_v0 output.

Data flows

These are the main execution paths behind the CLI. All three produce structured dictionaries first, then render human output for the terminal or files.

Analyze flow

frx analyze TARGET
  -> detect input type
  -> analyze bundle, CUDA source, PTX, or NCU CSV
  -> generate recommendations
  -> print diagnosis or JSON wrapper

Explain flow

frx profile --ncu report.csv --explain
  -> analyze NCU counters
  -> reuse the in-memory ncu_result
  -> build the bottleneck-specific brief
  -> write frx_summary.txt
  -> write frx_llm_prompt.txt
  -> write frx_evidence.json

frx collect --explain -- python train.py
  -> collect and analyze the new run directory
  -> build the training telemetry brief
  -> write frx_summary.txt, frx_llm_prompt.txt, frx_evidence.json

Bench flow

frx bench bad.cu good.cu
  -> compile_kernel x2 with nvcc
  -> time_binary x2 with warmup + measured runs
  -> optional profile_with_ncu x2 to collect CSV evidence
  -> bench_compare
  -> print speedup table + bottleneck diff

Architecture-Aware Scoring

GPU model selection is a scoring input, not a display hint. Fournex now detects common NVIDIA product names automatically when a GPU name is available. Pass --gpu-model only when you need to override the detected model or analyze code for a different target deployment.

# Automatic when the environment reports a known GPU name
NVIDIA RTX 5060 -> rtx5060
NVIDIA H100 80GB HBM3 -> h100
NVIDIA A100-SXM4-80GB -> a100

# Explicit override still has highest precedence
frx analyze kernel.cu --gpu-model H100
frx compare baseline.cu optimized.cu --gpu-model RTX4090
frx profile --ptx kernel.ptx --gpu-model L4
FamilyExamplesWhy it matters
sm_75Turing / T4Lower shared-memory threshold, older tensor-core alignment expectations
sm_80Ampere / A100Fallback thresholds when no GPU model is detected or supplied
sm_86RTX 30xxAmpere-family consumer GPU limits
sm_89Ada / L4 / RTX 40xxAda launch and resource thresholds
sm_90Hopper / H100Higher shared-memory tolerance, relaxed register pressure, Hopper tensor-core alignment
sm_100Blackwell datacenterBlackwell-family placeholder for datacenter parts
sm_120Blackwell consumer / RTX 50xxConsumer Blackwell calibration, including RTX 5060-class limits

Product names are accepted too: T4, A100, H100, L4, RTX4090, RTX5060, and similar aliases resolve to an SM family. If no model is detected or provided, Fournex uses Ampere defaults, which may flag shared-memory usage that is fine on Hopper.

SignalArchitecture effect
Shared memoryLarge shared-memory usage can be risky at 48 KB on Turing but acceptable near 96 KB on Hopper.
Register pressureAmpere flags lower register-variable counts than Hopper, where larger kernels can be normal.
Tensor core alignmentWMMA paths expect 16-multiple alignment; Hopper WGMMA paths can require 64-multiple alignment.

CUDA analysis schemas

The PyTorch SDK event IR remains the run-bundle event model. CUDA analysis adds two output schemas beside it for explain and bench workflows.

frx_explain_v0

Produced by build_explain_result(ncu_result, static_result). It is designed for human summaries, LLM prompts, and dashboard ingestion.

FieldDescription
schemaSchema name: frx_explain_v0
layers_availableWhich layers contributed evidence
primary_diagnosisTop diagnosis selected for the brief
diagnosesAll diagnosis objects and confidence labels
key_metricsProfiler and NCU metrics used in the explanation
static_findingsCUDA source findings included in the brief
ncu_bottlenecksNCU bottleneck labels with score 0-1
top_kernelsNCU kernels ranked by opportunity score with MFU, roofline region, and runtime share
rooflineNCU roofline region and MFU summary
occupancy_summaryNCU occupancy limiter and efficiency breakdown
top_recommendationsRecommendation cards with id, title, priority, tier, score, estimated_speedup_pct_min/max, why, first two actions, and validation_steps
missing_dataMissing layer and reason entries for evidence the user should collect next

frx_bench_v0

Produced by bench_compare(before_src, after_src, ...). It reports compile status, timing distributions, optional NCU evidence diffs, and the computed speedup.

FieldDescription
schemaSchema name: frx_bench_v0
archArchitecture or GPU model used for compilation/profiling
beforeSource path plus timing: median_ms, min_ms, max_ms, stdev_ms, runs, and warmup
afterSame timing object for the optimized side
speedup_xbefore.median / after.median; values greater than 1 mean after is faster
ncu_diffnull or diff_ncu_runs output when optional NCU profiling is enabled
compile_errorsCompilation failures by side

validation_step shape

{
  "metric": "NCU metric name",
  "label": "human-readable metric label",
  "direction": "decrease | increase | stable",
  "threshold_good": 50.0,
  "expected": "rises above 50%",
  "current_value": 7.0
}

current_value is a float when measured from the actual NCU run and null when that metric was not in the CSV.

Bottleneck labels

LabelDisplay nameSignal
input_boundInput Pipeline StarvationDataLoader wait ≥ 20% of step time
copy_boundHost-to-Device Copy OverheadH2D copy time ≥ 15% of step time
sync_boundSynchronization OverheadSync wait ≥ 10% of step time
underutilized_gpuGPU Under-utilizationGPU utilization < 35% (symptom)
memory_pressureMemory PressurePeak memory ratio ≥ 90%
shape_instabilityShape InstabilityShape volatility ratio ≥ 30%
launch_boundKernel Launch OverheadProfiler windows with many short kernels, stable shapes, low sampled util, and no dominant input/copy/sync stall
memory_bandwidth_boundMemory Bandwidth SaturationNCU DRAM throughput is high and memory stalls dominate
warp_stall_memoryMemory Warp StallsNCU dominant warp stall is a memory stall reason
warp_stall_syncSynchronization Warp StallsNCU dominant warp stall is barrier or wait
l1_cache_thrashingL1 Cache ThrashingNCU L1 cache hit rate is below 40%
l2_cache_thrashingL2 Cache ThrashingNCU L2 cache hit rate is below 50% without L1-heavy cache misses
uncoalesced_accessUncoalesced Global LoadsNCU global load sectors per request is above 4
tensor_core_underutilizedTensor Core Under-utilizationNCU tensor pipe utilization is low while occupancy is adequate
occupancy_limitedOccupancy LimitedMeasured achieved occupancy is below threshold
occupancy_limited_by_registersOccupancy Limited by RegistersLow occupancy with registers identified as the launch resource limit
occupancy_limited_by_shared_memoryOccupancy Limited by Shared MemoryLow occupancy with shared memory per block limiting residency
occupancy_limited_by_block_sizeOccupancy Limited by Block SizeLow occupancy from too few resident threads or blocks per SM
low_warp_scheduler_utilizationLow Warp Scheduler UtilizationFew eligible warps per scheduler cycle or low scheduler active percentage
low_issue_efficiencyLow Issue EfficiencyNCU issue slot utilization is below threshold
insufficient_telemetryInsufficient TelemetryNo timing data and no GPU util samples
insufficient_ncu_dataInsufficient NCU DataNCU CSV parsed but did not contain enough recognized metrics to classify

Labels are stable identifiers used in summary.json, CLI output, and the web frontend. The recommendation engine maps each label to a set of ranked fix cards.

Recommendation reference

Recommendation IDs are stable keys from the rule catalog. They appear in API responses, JSON summaries, CLI recommendation cards, and withheld-recommendation explanations.

Validation commands

NCU recommendation cards now include a Validate section with the exact counter command to run, the expected direction of movement, the current value when present in the NCU CSV, and an optional target. This keeps fixes testable instead of relying on intuition.

Validate:
  ncu --metrics dram__throughput.avg.pct_of_peak_sustained_elapsed,l1tex__t_sector_hit_rate.pct \
      --csv ./report.csv ./your_app
  <-- DRAM throughput %: was 82.0; decreases as shared memory absorbs repeated global accesses
  --> L1 cache hit rate %: was 31.0; increases as shared-memory tiling reduces L1 misses (target: 60.0)

If a metric was not present in the NCU CSV, the was X prefix is omitted rather than fabricated. In JSON, each validation step includes current_value, which is numeric when known and null when the evidence was not collected.

PTX static-analysis recommendations

IDEmitted forGuidance
rec_ptx_reduce_register_pressureptx_register_spills or ptx_register_pressureReduce live ranges, split large kernels, lower per-thread temporaries, or inspect compiler spill causes.
rec_ptx_stage_global_memoryptx_global_memory_heavyStage reused global-memory data through shared memory or improve access locality when the access pattern allows it.
rec_ptx_reduce_fp64ptx_fp64_usageConfirm FP64 is required; use FP32, TF32, FP16, or BF16 where numerical tolerance permits.
rec_ptx_reduce_branch_divergenceptx_branch_divergence_riskReduce data-dependent branching, split divergent paths, or restructure predicates around warp-uniform work.

PTX recommendation bundles use the ptx_static label so callers can distinguish static compiler-signature advice from profiler-measured runtime advice.

CLI/API output examples

CLI and API analysis responses share the same diagnosis-oriented fields: primary_bottleneck, ranked bottlenecks, ranked recommendations, and lower-level evidence for callers that need raw diagnostics.

CLI single-file examples

frx analyze kernel.ptx
frx analyze kernel.cu
frx analyze ncu_report.csv

frx analyze kernel.ptx --json
# {"mode": "ptx", "result": {...}}

CLI comparison examples

frx compare baseline.cu optimized.cu
frx compare baseline.cu optimized.cu --with-ptx --with-ncu
frx compare baseline.cu optimized.cu --ncu-a baseline.csv --ncu-b optimized.csv

# Older single-evidence comparison path remains available
frx analyze --before before.ptx --after after.ptx
frx analyze --before before.csv --after after.csv --json

NCU kernel attribution excerpt

{
  "kernel_attribution": {
    "has_runtime_share": true,
    "top_opportunities": [
      {
        "kernel_name": "fused_attention",
        "runtime_share_pct": 38.4,
        "mfu_pct": 14.2,
        "roofline_region": "memory_bound",
        "opportunity": "high",
        "opportunity_score": 0.847
      }
    ]
  }
}

Framework Abstraction Tax CLI block

FRAMEWORK ABSTRACTION TAX
  Score              : 74/100 (high)
  Contributors:
   - Kernel launch fragmentation
   - Missing graph capture (opportunity) (inferred)
   - Unfused elementwise operations (opportunity) (inferred)

Framework Abstraction Tax appears on runtime-analysis paths, not NCU-only CSV analysis. Inferred contributors are opportunities to investigate, not proof that graph capture or fusion is disabled.

PTX API response excerpt

{
  "findings": [
    {
      "kind": "register_spills_detected",
      "severity": "HIGH",
      "kernel": "matmul_kernel"
    }
  ],
  "primary_bottleneck": "ptx_register_spills",
  "bottlenecks": [
    {
      "label": "ptx_register_spills",
      "score": 0.95,
      "evidence": ["local memory spill loads/stores detected"]
    },
    {
      "label": "ptx_register_pressure",
      "score": 0.72,
      "evidence": ["virtual register count exceeds threshold"]
    }
  ],
  "recommendations": [
    {
      "id": "rec_ptx_reduce_register_pressure",
      "bundle": "ptx_static",
      "severity": "high",
      "reason": "Register spills outrank generic register pressure."
    }
  ],
  "bundles": ["ptx_static"]
}

Python helper shape

from fournex.ptx import analyze_ptx_text

result = analyze_ptx_text(ptx_text)

result["primary_bottleneck"]       # "ptx_register_spills"
result["bottlenecks"]              # ranked PTX bottleneck list
result["recommendations"]          # ranked recommendation cards
result["bundles"]                  # includes "ptx_static"

SDK integration

When collect wraps your training script, it sets FRX_AUTO_PERSIST=1 and injects the output path. The SDK hooks emit events automatically if you use the provided context managers or callbacks.

PyTorch training loop

from fournex import AutopilotSession

session = AutopilotSession.from_env()   # reads FRX_* env vars

for epoch in range(num_epochs):
    for batch in dataloader:
        with session.step(step_id=global_step, step_kind="train"):
            with session.dataloader_span():
                batch = next_batch()    # already inside dataloader loop
            with session.forward_span():
                loss = model(batch)
            with session.backward_span():
                loss.backward()
            with session.optimizer_span():
                optimizer.step()
        global_step += 1

session.flush()

If you already use torch.profiler, you can skip the SDK entirely and point --artifact-dir at the directory where the profiler writes its Chrome-format trace. The CLI will import and analyze it automatically.

Profiler-only workflow

# In your training script, write the profiler trace to frx-job-run/
profiler = torch.profiler.profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    on_trace_ready=torch.profiler.tensorboard_trace_handler("frx-job-run"),
)

# Then collect and generate the LLM brief
frx collect --explain -- python train.py

# If the trace is written somewhere else, pass that directory explicitly
frx collect --artifact-dir gpu-job-run-tiny-kernels --explain -- python train.py

Note on data richness

SDK instrumentation produces the richest data: exact step boundaries, DataLoader wait times, and H2D copy spans are recorded precisely. Profiler-only mode reconstructs these from Chrome trace heuristics and may have lower confidence on some bottleneck types.

Static CUDA inspector

POST /cuda/static-inspect

Inspects one or more .cu or .cuh source files and returns per-kernel heuristic analysis without compiling or running the code. Useful for surfacing configuration risks early in the development cycle.

The antipattern library is YAML-driven: 22 rules across memory, synchronization, control flow, occupancy, and tensor-core categories. Each finding includes a stable code, severity, a plain-English message, and linked recommendations.

Static detection now follows common source patterns more closely.uncoalesced_access catches strided aliases such as int idx = tid * stride; src[idx], not only inline subscript math. sync_inside_tight_loop only fires when __syncthreads() calls are actually inside the loop body, so setup syncs before an unrelated loop stay silent.

CLI

frx analyze kernel.cu
frx analyze kernel.cu --gpu-model RTX4090
frx analyze kernel.cuh --output-json

Antipattern categories

CategoryExamples
MemoryUncoalesced access, missing bounds guards, repeated global loads
SynchronizationExcessive barriers, suspicious sync placement
Control flowBranch divergence risk and data-dependent paths
OccupancySub-warp block sizes, non-warp-aligned block sizes, low theoretical occupancy
Tensor coresDimension and instruction patterns that block tensor-core use

CUDA antipattern zoo

The demos/cuda_zoo/ directory contains paired bad/good kernels for learning and regression checks. Each bad kernel triggers a specific static rule; each good kernel clears it. Use these examples when you want to see what a finding looks like in actual CUDA source.

# Example: inspect the uncoalesced access pair
demos/cuda_zoo/01_uncoalesced/bad.cu
demos/cuda_zoo/01_uncoalesced/good.cu

# Run all four bad/good pairs without a GPU
./demos/cuda_zoo/run_zoo.ps1

Request shape

POST /cuda/static-inspect
Content-Type: application/json

{
  "gpu_model": "NVIDIA H100",
  "files": [
    { "filename": "kernel.cu", "content": "__global__ void my_kernel(...) { ... }" }
  ]
}

gpu_model is optional. When supplied, the inspector looks up hardware limits from a built-in preset table (device_limits_for_gpu) and feeds them to the existing occupancy estimator so advice is calibrated to the target device. Omit it to use generic defaults.

Response fields

FieldDescription
kernelsDetected __global__ kernel signatures
launch_configsLaunch parameters extracted from kernel<<<grid, block>>>() call sites
indexing_patternsThread/block indexing styles detected in kernel bodies
memory_tagsMemory access style tags (coalesced, strided, shared, etc.)
atomicsAtomic and reduction operation hints
shared_memory_warningsBank conflict risk, large static allocations, suspicious barrier patterns
launch_adviceHeuristic safe starting configurations for grid and block dimensions
occupancyEstimated occupancy when gpu_model is supplied

Occupancy and GPU model presets

When gpu_model is provided, the backend resolves hardware limits (max threads per block, shared memory per SM, register file size, warp size) from a preset table and passes them to the occupancy estimator. Occupancy is estimated from static metadata, not measured runtime occupancy.

Heuristic caveat

Static CUDA inspection is heuristic source analysis, not compiler validation. Launch suggestions are safe starting points for benchmarking, not proven optimal configurations. Always profile with frx profile to measure actual kernel behavior, or use frx collect when you need a full application run bundle.

Eval coverage

The analysis pipeline and recommendation engine are validated by a suite of deterministic evals that run against synthetic event streams and profiler traces. No live GPU is required.

AreaCoverage
Bottleneck fixturesGolden event-stream fixtures for each bottleneck label; verify primary_bottleneck and user_facing_bottleneck match expected values
Classifier boundariesThreshold boundary tests for input, copy, sync, memory, and shape classifiers; labels must flip at documented signal thresholds
Confidence calibrationinput_bound confidence checked at signal levels above and below the high-confidence threshold
Recommendation evalsExpected recommendation keys must be present; negative checks confirm absent recommendations when signal is below threshold
launch_bound evalsKernel count, median duration, and small-kernel fraction combinations that should and should not trigger launch_bound
ExperimentRunner speedupDeterministic synthetic summaries confirm winner selection, noise filtering, and quality gate rejection

All assertions are data-driven: changing a classifier threshold or recommendation rule automatically updates which evals pass. Run pytest backend/tests/python -q to execute the full suite.

Nsight Compute Workflow

frx profilefrx ncu-command

Use frx profile for the normal path: it runs NCU when needed, analyzes the result, and prints the fix list in one command. Use frx ncu-command only when you want to copy the raw Nsight Compute command into CI or a separate shell.

Preset list

frx ncu-command --list
frx ncu-command --list --json
PresetUse it forPrimary evidence
memoryDRAM bandwidth, cache hit rates, uncoalesced loads, and memory-related stallsDRAM throughput, L1/L2 hit rates, global load sectors/request, memory throttle and scoreboard stalls
tensorTensor core utilization with basic occupancy and issue contextTensor pipe activity, achieved occupancy, issue slot utilization
occupancyLaunch resource limits that explain low achieved occupancyWarps active, block size, registers per thread, static and dynamic shared memory
stallsWarp stall reasons across memory, synchronization, and scheduler pressureMemory throttle, long scoreboard, barrier, wait, dispatch, not-selected, eligible-warps-per-scheduler, scheduler-active, and issue-active metrics
fullBroad capture when you do not know the bottleneck yetUnion of memory, tensor, occupancy, and stalls presets, including scheduler metrics

Recommended capture path

# Simple path: run, analyze, and generate the LLM brief in one command
frx profile --preset full --explain -- python train.py

# Analyze an existing NCU export and generate the LLM brief
frx profile --ncu ncu_report.csv --explain

# Manual path: generate the NCU command, then analyze the CSV
frx ncu-command full --output ncu_report.csv -- python train.py

# Focus on memory evidence
frx ncu-command memory --output ncu_memory.csv -- python train.py

# Limit capture to a kernel name and a small launch window
frx ncu-command stalls \
  --kernel-name "my_kernel" \
  --launch-skip 10 \
  --launch-count 5 \
  --output ncu_stalls.csv \
  -- python train.py

# Then analyze the CSV locally
frx profile --ncu ncu_report.csv --explain
frx analyze ncu_report.csv
frx analyze ncu_report.csv --json

frx profile --ncu ncu_report.csv and frx analyze ncu_report.csv both validate the CSV before reporting. Missing metrics produce partial-evidence diagnostics instead of zero-filled results, and malformed CSVs exit nonzero with a clear parsing error.

WSL2 and WDDM caveat

Nsight Compute hardware counters can be restricted under WSL2 or Windows WDDM driver mode. If NCU reports missing or inaccessible performance counters, run on a Linux host or a Windows setup with supported counter access before treating the missing metrics as workload behavior.

NCU analysis

POST /ncu/analyze

Ingests Nsight Compute (NCU) CSV exports and returns structured hardware bottleneck analysis. Both long-format (Metric Name, Metric Value rows) and direct-column layouts produced by different NCU export modes are accepted.

Supported metrics

Metric groupNCU counter (example)Used for
Warp stall reasonssmsp__warp_issue_stalled_*warp_stall_memory and warp_stall_sync labels
DRAM throughputdram__throughput.avg.pct_of_peak_sustained_elapsedmemory_bandwidth_bound label
L1 cache hit ratel1tex__t_sector_hit_rate.pctl1_cache_thrashing label
L2 cache hit ratelts__t_sector_hit_rate.pctl2_cache_thrashing label
Global load coalescingl1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_lduncoalesced_access label
Tensor core utilizationsm__inst_executed_pipe_tensor_op_hmma.avgtensor_core_underutilized label
Issue slot utilizationsm__issue_active.avg.pct_of_peak_sustained_activelow_issue_efficiency label
Scheduler utilizationsmsp__warps_eligible.avg.per_cycle_active, smsp__warps_active.avg.pct_of_peak_sustained_activelow_warp_scheduler_utilization label in stalls and full presets
Occupancy / launch metadataachieved_occupancy.pct, sm__warps_active.avglaunch_efficiency dimension in comparison scorecard
Kernel durationduration, gpu__time_duration_sumruntime_share_pct and time-weighted kernel opportunity scoring

Recommended NCU command

frx profile --preset full --explain -- python train.py
frx profile --preset memory --explain -- python train.py
frx profile --ncu ncu_report.csv --explain

frx ncu-command full --output ncu_report.csv -- python train.py
frx ncu-command memory --output ncu_memory.csv -- python train.py
frx ncu-command occupancy --launch-count 10 --output ncu_occupancy.csv -- python train.py

NCU prefixes its CSV output with ==PROF== preamble lines. These are stripped automatically before parsing. Extra metric columns beyond the documented set are silently ignored, and missing metrics are reported as partial evidence instead of being treated as zeros. Malformed CSV input fails clearly in frx analyze ncu_report.csv.

NCU 2026.x CSV format

Nsight Compute 2026.x changed the default CSV shape from tall format, one metric per row, to wide format, one kernel per row with metrics as columns. Fournex handles both automatically, including the final done line NCU writes to stdout after the CSV. If an older CSV-producing wrapper stops parsing, check the NCU version and export mode first.

Kernel attribution

When NCU includes per-kernel duration, Fournex adds kernel_attribution to the NCU result. The top opportunities list ranks kernels by the work most likely to matter, not just by the worst-looking individual metric.

result["kernel_attribution"]["top_opportunities"]
# [
#   {
#     "kernel_name": "fused_attention",
#     "runtime_share_pct": 38.4,
#     "mfu_pct": 14.2,
#     "roofline_region": "memory_bound",
#     "opportunity": "high",
#     "opportunity_score": 0.847
#   }
# ]
FieldMeaning
opportunity_scoreruntime_share x severity x mfu_gap when duration is available; severity-only fallback otherwise
has_runtime_sharetrue when NCU provided Duration and scores are time-weighted
runtime_share_pctPercent of total measured kernel time attributed to this kernel
mfu_pctModel FLOP utilization estimate for compute opportunity context
roofline_regionCoarse region such as memory_bound or compute_bound
opportunityHuman label such as high, medium, low, or none

Memory-bound kernels use high severity, low-MFU compute kernels retain meaningful opportunity, and well-utilized kernels score near zero. If has_runtime_share is false, do not compare scores across kernels as a time-weighted priority list.

Request

POST /ncu/analyze
Content-Type: application/json

{
  "csv_text":  "...",           // raw NCU CSV text
  "gpu_model": "NVIDIA H100"   // optional — passed to occupancy estimator
}

Detected bottleneck labels

LabelSignal
memory_bandwidth_boundDRAM throughput utilization exceeds saturation threshold
warp_stall_memoryMemory stalls dominate the warp stall breakdown
warp_stall_syncSync stalls dominate the warp stall breakdown
l1_cache_thrashingL1 hit rate below 40%
l2_cache_thrashingL2 hit rate below 50% when the L1-heavy cache-miss signal is absent
uncoalesced_accessGlobal load sectors per request above 4
tensor_core_underutilizedTensor core utilization low when FP16/BF16 work is present
occupancy_limitedMeasured achieved occupancy is below threshold
occupancy_limited_by_registersLow occupancy with registers identified as the limiting launch resource
occupancy_limited_by_shared_memoryLow occupancy with shared memory per block limiting residency
occupancy_limited_by_block_sizeLow occupancy from too few resident threads or blocks per SM
low_warp_scheduler_utilizationFew eligible warps per scheduler cycle or low scheduler active percentage
low_issue_efficiencyIssue slot utilization (IPC proxy) below threshold
insufficient_ncu_dataNot enough metrics in the CSV to classify

Warp stall breakdown

NCU exports per-reason stall metrics (e.g., smsp__warp_issue_stalled_*). The analyzer sums all stall categories, identifies the dominant reason, and maps it towarp_stall_memory or warp_stall_sync. The full percentage breakdown is returned in the response for all detected stall categories.

memory_stall_fraction is magnitude-based: a kernel where 5% of warp cycles are stalled on memory reports 0.05, not 1.0. This lets thresholds be set in natural units rather than normalized fractions.

Memory access diagnosis

Cache and coalescing issues are reported separately so the first fix is clearer. l1_cache_thrashing fires when L1 hit rate is below 40%. l2_cache_thrashing fires when L2 hit rate is below 50% and the L1-heavy miss signal is absent.uncoalesced_access fires when global loads average more than 4 sectors per request.

Occupancy diagnosis

Occupancy bottlenecks use measured achieved occupancy first when NCU provides it. Launch-resource metrics then explain why occupancy is low, mapping register pressure, shared memory per block, or insufficient resident threads/blocks to the specific occupancy_limited_by_* labels. If measured occupancy is absent, Fournex falls back to launch-resource estimates when those inputs are available.

Multi-kernel aggregation

Real NCU exports often contain partial metrics across multiple kernels. Aggregation ignores missing metrics instead of treating them as zeros, so a kernel without a counter does not dilute the average for that counter.

CaseBehavior
Missing metric on a kernelExcluded from that metric average
L2-only kernelCounts as having NCU data
Warp stall counters present on some kernelsStall fractions average only over kernels with stall data
Summary fieldkernels_with_warp_stall_data reports the stall-sampled kernel count

Recommendation integration

NCU signals are wired into the existing recommendation engine. Rules cover DRAM saturation, L1 and L2 cache pressure, uncoalesced global loads, tensor core underuse, occupancy limits, and warp efficiency. Recommendations follow the same ranked card format as other bottleneck types and appear alongside CLI-based recommendations when both data sources are present.

PTX analysis

POST /ptx/analyze

Pure static analysis of PTX — NVIDIA's virtual ISA, the intermediate representation between CUDA C++ and machine code. No GPU, driver, or NVIDIA tooling is required beyond the PTX text itself.

analyze_ptx_text() now emits actionable diagnosis fields in addition to raw findings: bottlenecks, primary_bottleneck, secondary_bottlenecks, recommendations, and bundles.

Request

POST /ptx/analyze
Content-Type: application/json

{ "ptx_text": "..." }   // raw PTX from nvcc --ptx or cuobjdump --ptx

Per-kernel analysis fields

FieldDescription
register_countVirtual register count; 64-bit types counted as 2
spills_detectedTrue if .local memory depot usage is present
spill_loadsNumber of .local load instructions
spill_storesNumber of .local store instructions
instruction_mixCounts per category across 17 instruction types
branch_countConditional branch instruction count
has_back_edgeWhether a loop back-edge (loop) was detected
capability_flagsDetected capabilities: fp64, tensor_cores, atomics, sfu

Instruction categories

CategoryCovers
fp32 / fp64 / fp16Float arithmetic by precision
integerInteger arithmetic
global_load / global_storeGlobal memory accesses
shared_load / shared_storeShared memory accesses
local_load / local_storeLocal memory (spill region) accesses
tensorwmma / mma tensor core instructions
sfusin, cos, rcp, sqrt approximations
atomicAtomic read-modify-write operations
control_flowbranch, call, return
conversion / comparisonType conversions and predicate comparisons
otherAll remaining instructions

PTX bottleneck labels

LabelSignalTypical recommendation
ptx_register_spillsLocal-memory spill loads or stores detectedrec_ptx_reduce_register_pressure
ptx_register_pressureVirtual register count exceeds pressure thresholdrec_ptx_reduce_register_pressure
ptx_global_memory_heavyGlobal memory instructions dominate the static mixrec_ptx_stage_global_memory
ptx_fp64_usageFP64 instructions detectedrec_ptx_reduce_fp64
ptx_branch_divergence_riskHigh conditional branch density or branch-heavy control flowrec_ptx_reduce_branch_divergence

Findings

FindingSeverityTrigger
register_spills_detectedHIGH.local depot present (load or store count > 0)
very_high_register_countHIGHVirtual register count >= 64
high_register_countMEDIUMVirtual register count >= 48
fp64_detectedMEDIUMfp64 instructions present
high_global_memory_ratioMEDIUMglobal_load + global_store > 30% of all instructions
high_branch_countMEDIUMBranch count exceeds threshold
special_function_opsLOWSFU instructions present
no_shared_memory_usageLOWZero shared memory load or store instructions
tensor_ops_detectedLOWwmma / mma instructions present
atomics_detectedLOWAtomic instructions present

Virtual vs. physical registers

PTX register counts are virtual and pre-register-allocation. The CUDA compiler assigns physical registers during PTX-to-SASS compilation and may use fewer than the virtual count suggests. When NCU data is also available, hardware-measured register and occupancy values take precedence over PTX estimates in the comparison scorecard.

Implementation comparison

frx comparePOST /compare

frx compare compares two CUDA source files across source, PTX, and NCU layers and produces a scored diff with a verdict. Start source-only, then add PTX or NCU evidence when you need stronger conclusions.

CLI

frx compare baseline.cu optimized.cu
frx compare baseline.cu optimized.cu --with-ptx --with-ncu
frx compare baseline.cu optimized.cu --ncu-a baseline.csv --ncu-b optimized.csv
frx compare baseline.cu optimized.cu --gpu-model H100
frx compare baseline.cu optimized.cu --build-flags "-DBUILD_EXECUTABLE"
frx compare baseline.cu optimized.cu --json

Output sections

SectionMeaning
WinnerVerdict and scores when enough data is available
Resolved in BFinding codes that disappeared in the optimized implementation
Regressions in BNew findings that appeared and should block blind promotion
Improved / RegressedScorecard dimensions where B wins or A still wins
Root causes in ACross-layer diagnoses from reconcile_evidence() with confidence labels
Still unknownSpecific missing evidence, scaled to the layers you did not provide
Upgrade hintsThe exact next flag to add, such as --with-ptx or --with-ncu

Evidence escalation

--with-ptx compiles both sides with nvcc -ptx and unlocks register-efficiency evidence.--with-ncu compiles runnable executables and captures NCU evidence for DRAM, cache, tensor-core, occupancy, and stall dimensions. Use --ncu-a and --ncu-b when you already have CSVs and do not want the compare command to run profiling.

Missing evidence

After root-cause diagnoses, frx compare prints a Missing evidence section for low-confidence findings. It tells you which metrics would confirm the diagnosis, how confidence would upgrade if collected, and the narrow NCU command to run.

-- Missing evidence --
inefficient_global_memory_access
  confidence: low-medium -> medium-high if confirmed
  collect: l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_ld
  ncu --metrics l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_ld \
      --csv ./your_app > missing_memory.csv

Fallback:
  ncu --set full --csv ./your_app > full.csv

The loop is: static analysis suspects a problem, Missing evidence tells you exactly what to collect, you collect it, then rerun frx compare --ncu-a baseline.csv --ncu-b optimized.csv to move the diagnosis toward confirmed evidence.

API and integration callers can use what_evidence_is_missing(static=..., ptx=..., ncu=...) for the same JSON-friendly data.

Request

POST /compare
Content-Type: application/json

{
  "a": {
    "label":       "baseline",
    "cuda_source": "...",    // optional
    "ptx_text":    "...",    // optional
    "ncu_csv":     "..."     // optional
  },
  "b": {
    "label":       "optimized",
    "cuda_source": "...",
    "ptx_text":    "...",
    "ncu_csv":     "..."
  }
}

Response structure

FieldDescription
static_diffStructural differences: indexing patterns, shared memory use, atomics, bank conflict risk, findings resolved or introduced
ptx_diffRegister count delta, spill resolution or introduction, per-category instruction mix deltas, findings that changed
ncu_diffHardware metric deltas: DRAM throughput, L1/L2 hit rates, issue slot utilization, occupancy, warp stall shifts
scorecardFour efficiency dimensions scored 0-1 for each side
verdictoverall_winner (a / b / tie), weighted scores, per-dimension winners
tradeoffsWarnings where one dimension improves while another regresses, such as lower register pressure with worse achieved occupancy

Scorecard dimensions

DimensionWeightBased on
register_efficiency20%Virtual register count and spill penalty; NCU physical count overrides when available
memory_efficiency30%DRAM throughput utilization, L1/L2 cache hit rates, shared memory presence
compute_efficiency30%Issue slot utilization (IPC proxy), tensor core usage, dominant warp stall reason
launch_efficiency20%Theoretical occupancy vs. achieved, launch configuration quality from static analysis

Scoring notes

When spills are detected, register_efficiency applies a penalty multiplier: reg_score = raw_score * (1 - SPILL_PENALTY). When NCU data is provided for a side, hardware-measured register and occupancy values override PTX-derived estimates for that side's register_efficiency and launch_efficiency dimensions. A side with NCU data is therefore scored on measured evidence while a PTX-only side is scored on static estimates.

Tradeoff warnings

Before/after comparisons include tradeoffs when evidence moves in conflicting directions. Current warnings cover cases such as register pressure or spills improving while achieved occupancy regresses, occupancy improving while register pressure regresses, and large occupancy shifts that should be checked against wall-clock time.

Sample verdict

{
  "overall_winner": "b",
  "score_a": 0.61,
  "score_b": 0.78,
  "dimensions": {
    "register_efficiency": "b",
    "memory_efficiency":   "b",
    "compute_efficiency":  "b",
    "launch_efficiency":   "tie"
  }
}

Explain

frx explain

frx explain turns an NCU CSV or PyTorch training run directory into a compact optimization brief. Use it when you already have an artifact and want to generate the brief later. For a one-command workflow, prefer frx profile --explain or frx collect --explain.

# NCU CSV with optional CUDA source
frx explain report.csv
frx explain report.csv --source kernel.cu
frx explain report.csv --out explain_out

# Existing PyTorch training telemetry run
frx explain runs/my-run --out explain_out

# Preferred when you are profiling or collecting now
frx profile --ncu report.csv --explain
frx collect --explain -- python train.py

NCU prompts include secondary issues, ranked top kernels for multi-kernel workloads, roofline region plus MFU, and the active occupancy limiter when present. Training prompts are bottleneck-specific and include measured runtime evidence.

Outputs

FilePurpose
frx_summary.txtHuman summary of the primary diagnosis, key evidence, and top fixes
frx_llm_prompt.txtLLM-ready optimization brief with EXPECTED IMPROVEMENT, validation targets, and re-profile command
frx_evidence.jsonMachine-readable frx_explain_v0 result for dashboards or custom tooling

Pipeline

frx explain report.csv
  -> auto-detect directory vs .csv
  -> analyze NCU counters or training telemetry
  -> build the bottleneck-specific result
  -> render frx_summary.txt
  -> render frx_llm_prompt.txt
  -> render frx_evidence.json

frx profile --ncu report.csv --explain
  -> analyze once
  -> reuse ncu_result
  -> render the same three files

Bench

frx bench

frx bench compiles two CUDA source files and wall-clock benchmarks them side by side. It is meant for quick before/after checks after you implement a kernel change.

frx bench bad.cu good.cu
frx bench bad.cu good.cu --with-ncu
frx bench bad.cu good.cu --gpu-model RTX5060
frx bench bad.cu good.cu --runs 20 --warmup 5

Pipeline

frx bench bad.cu good.cu
  -> compile_kernel x2 with nvcc
  -> time_binary x2 with warmup + measured runs
  -> optional profile_with_ncu x2
  -> bench_compare
  -> print speedup table + bottleneck diff

Output

FieldMeaning
before / after timingmedian_ms, min_ms, max_ms, stdev_ms, runs, and warmup
speedup_xbefore median divided by after median; greater than 1 means after is faster
ncu_diffOptional bottleneck and metric diff when --with-ncu is enabled
compile_errorsCompilation failures captured per side

Timing requirement

frx bench uses wall-clock process timing. The benchmark binary must call cudaDeviceSynchronize() before exit or the host process may finish before GPU work is complete.

Evidence reconciliation

POST /reconcile

Reconciliation merges signals from any combination of source, PTX, NCU, and profiler layers into unified cross-layer diagnoses. It is useful for dashboards, IDE integrations, or review tools that want one confidence-ranked explanation instead of separate layer-specific findings.

POST /reconcile
Content-Type: application/json

{
  "static":   { "...": "result from CUDA source inspection" },
  "ptx":      { "...": "result from PTX analysis" },
  "ncu":      { "...": "result from NCU CSV analysis" },
  "profiler": { "...": "result from runtime profiling" }
}

Response shape

FieldDescription
diagnosesUnified cross-layer diagnoses with confidence labels such as medium, high, or confirmed
unreconciledLayer-specific findings that did not yet have enough supporting evidence from other layers
layers_availableWhich of source, PTX, NCU, and profiler evidence were present in the request

Confidence rises as independent layers agree. A source-only signal may be medium confidence; source plus PTX or NCU can move to high; strong multi-layer agreement can become confirmed.

Known gaps

These are current constraints to account for when designing a profiling workflow or interpreting results.

GapImpact
frx bench uses wall-clock timingThe benchmark binary must call cudaDeviceSynchronize() before exit for accurate GPU timing.
NCU profiling on Windows WDDMHardware-counter profiling often requires administrator privileges or switching supported GPUs to TCC mode.
RTX 5060 / sm_120 PC samplingPC-sampling metrics such as smsp__pcsamplingdata_* are not available on RTX 5060-class sm_120 devices.
No Roofline or MFU in bench v0Bench reports wall-clock speedup and optional bottleneck diffs, but not Roofline placement or model FLOP utilization.
Framework Abstraction Tax V1 calibrationSeverity bands are reasoned thresholds and are not yet validated against measured speedups.
Framework tax graph/fusion stateGraph capture and fusion contributors are inferred opportunities, not asserted facts from explicit framework state detection.
Framework tax on NCU-only pathsNCU-only analysis returns no framework_abstraction_tax because it needs runtime profiler windows.

Validation

What Fournex has confirmed in automated testing. CLI coverage now includes frx init, frx profile, direct frx analyze file inputs, frx compare,frx explain, frx bench, reconciliation, GPU model detection, NCU validation cards, and YAML-driven CUDA rules. Offline checks run without a live GPU unless a live NCU or bench path is explicitly requested.

CheckResult
Package integrityv0.1.4 wheel includes recommendations/catalog.yaml and recommendations/rules.yaml — confirmed after the pyproject.toml package-data fix in v0.1.4
Clean installfrx smoke-test passes from a fresh pip install with no manual venv patching
Init commandfrx init covers dependency checks, GPU model detection, training-script discovery, patch prompting, and already-instrumented script skips
Profile commandfrx profile covers live NCU execution, existing --ncu CSV analysis, --ptx static analysis, presets, JSON output, kernel filtering, launch-window options, and --explain output
Collect explain workflowfrx collect --explain generates the LLM brief from the newly created run directory after collection completes
CLI single-input detectionfrx analyze auto-detects .cu/.cuh, .ptx, Nsight Compute .csv, run directories, and zip bundles
Compare CLIfrx compare covers source-only comparison, --with-ptx, --with-ncu, pre-existing --ncu-a/--ncu-b CSVs, --build-flags, and --json output
Explain CLIfrx explain auto-detects NCU CSVs and training run directories, then emits frx_summary.txt, a bottleneck-specific frx_llm_prompt.txt, and frx_evidence.json
Bench CLIfrx bench compiles two .cu kernels, discards warmup runs, reports speedup_x, and can add optional NCU bottleneck diffs
Evidence reconciliation33 tests cover all 6 cross-layer diagnoses, confidence escalation, unreconciled tracking, false-positive guards, and the /reconcile API stub
Architecture-aware scoringGPU model calibration covers sm_86 through sm_120, including Blackwell consumer sm_120, and product-name detection maps known NVIDIA GPU names to model keys
CUDA antipattern rules22 YAML-driven rules across memory, synchronization, control flow, occupancy, and tensor-core categories replace hardcoded static-detection conditions
CLI output and errorsHuman output includes primary bottleneck and top recommendation; JSON output parses and includes mode; bad paths and unsupported file types exit nonzero with clear messages
NCU memory diagnosticsRegression tests cover L1-only thrashing, L2-only thrashing, sectors-per-request CSV parsing and score calibration, coalesced-kernel false-positive prevention, and uncoalesced recommendation routing
Recommendation validationNCU recommendations thread validation_steps through the engine, attach current_value when measured, include estimated speedup ranges, and render exact Validate commands
FastAPI optional dependencyComparison endpoint tests skip cleanly when httpx is not installed
NCU / PTX / comparison goldenOffline golden fixture passes: memory-bound spilling baseline identified correctly, optimized implementation ranked higher across all four efficiency dimensions
Classifier and eval suite131 tests passing: bottleneck fixtures, classifier boundary tests, confidence calibration, recommendation evals, launch_bound combinations, and ExperimentRunner speedup

Scope of Fournex today

Fournex is a diagnostic and recommendation tool, not an automatic CUDA optimizer. It identifies bottlenecks, surfaces evidence, and ranks fix candidates — but it does not rewrite kernels or apply changes automatically. Real speedup must be confirmed by re-profiling after code changes. Treat recommendations as informed starting points, not proofs.