CLI Quick Start

From zero to profiling
in two minutes.

Install frx, then choose the evidence path that matches your workload. Profile CUDA kernels with NCU counters, or collect runtime telemetry from a PyTorch training run. Both paths can produce an LLM-ready brief.

Fournex provides lightweight Python runtime telemetry by default, with an optional native CUPTI/NVML backend (BUILD_NATIVE=1) for deeper kernel tracing. For exact hardware-counter claims, feed it an Nsight Compute CSV.

Prerequisites

Python 3.11+ is required. PyTorch is optional - the CLI works without it for static analysis. For live kernel profiling you need ncu (Nsight Compute) on your PATH where you collect data.

Python >= 3.11
NVIDIA GPU + CUDA driver (for live profiling)
Nsight Compute - ncu on PATH (for live profiling)
PyTorch (only when the SDK instruments a live training run)

01 Install

Install from PyPI. This registers both frx and fournex as executables.

bash

pip install fournex

Verify the install:

bash

frx doctor

02 Choose your evidence path

Use training telemetry when the bottleneck may live above an individual kernel. Use NCU when you need hardware-counter evidence and kernel-level attribution.

The value is the pipeline that turns telemetry, NCU, PTX, and source into reconciled explanations and ranked, validated recommendations - the layer above profilers, not a replacement for Nsight.

PyTorch training telemetry

bash

# 1. Collect a named training run
frx collect --name my-run -- python train.py

# 2. Analyze the training run
frx analyze runs/my-run

# 3. Create the LLM-ready optimization brief
frx explain runs/my-run --out ./brief/

CUDA kernel analysis with NCU

NCU reads hardware counters. Fournex analyzes the CSV and turns those counters into bottlenecks, evidence, ranked kernels, and fixes. These are separate steps because NCU often needs admin or root privileges to access GPU performance counters.

bash

# 1. Print the exact NCU command Fournex expects
frx ncu-command -- ./my_binary

# 2. Run that NCU command with the privileges your system requires
sudo $(frx ncu-command -- ./my_binary) > profile.csv

# 3. Analyze the CSV with Fournex
frx profile --ncu profile.csv --gpu-model H100

# Optional: override peak FLOPs / memory bandwidth for roofline and MFU math
frx profile --ncu profile.csv --gpu-model H100 --arch-profile arch.yaml

# Optional: inspect CUDA source for the GPU you deploy on
frx analyze kernel.cu --gpu-model H100

# Optional: no GPU available, inspect PTX statically
frx profile --ptx kernel.ptx --gpu-model H100

Use frx ncu-command first

If you run NCU manually with the wrong metrics, Fournex can only produce partial results and DATA NOTES warnings. frx ncu-command prints the metric set Fournex expects.

Windows and WSL2 caveat

Consumer NVIDIA GPUs in default Windows WDDM mode do not expose NCU hardware counters to NCU running inside WSL2. Run NCU from a native Windows terminal with Nsight Compute installed, or collect on a true Linux machine.

The CSV is portable

Collect profile.csv on a server, cluster, or teammate's machine. Then analyze it on your laptop with frx profile --ncu profile.csv. No GPU is required at analysis time.

Training run directories are explainable

frx explain runs/my-run accepts existing run directories with derived/summary.json, raw/trace.jsonl, or profiler/*.json and writes the same three brief files as the CSV path.

Pass the target GPU model

--gpu-model changes thresholds, not just labels. Without it, Fournex uses Ampere defaults. Pass H100, L4, RTX4090, T4, or the SM family when analyzing code for that GPU.

Override hardware specs when needed

--arch-profile <path.yaml> is available on analyze, profile, compare, and explain. Use it for custom hardware, pre-production SKUs, or correcting peak TFLOP/s and memory-bandwidth assumptions used by roofline and MFU calculations.

Bad paths fail cleanly

A typo in --arch-profile or --config now prints frx: <message> and exits with code 1 instead of dumping a Python traceback.

NCU 2026.x CSV works

Fournex handles older tall CSVs and newer wide CSVs automatically, including the final done line that NCU writes after the report.

The report sections you'll see:

VERDICTPrimary bottleneck and confidence.

MEASURED METRICSEvery metric annotated with [!!], [ !], [ok], or [--].

BOTTLENECKS DETECTEDRanked bottlenecks with numeric scores.

FRAMEWORK ABSTRACTION TAXRuntime overhead score when profiler telemetry is available.

RECOMMENDATIONSOrdered fixes with why, actions, and validation steps.

NEXT STEPSThe exact re-run command to validate the top fix.

Recommendation cards include a validation plan with the current measured value when that metric was present in the NCU CSV:

text

Validate:
  ncu --metrics smsp__pipe_tensor_op_hmma_cycles_active.avg.pct_of_peak_sustained_active --csv ./your_app
  --> Tensor core utilization %: was 7.0; rises above 50% (target: 50.0)

If the CSV did not include that metric, Fournex omits the was X hint instead of inventing a value.

If profiler telemetry is available, the report may also include Framework Abstraction Tax. This is a runtime overhead signal, not an NCU-only hardware counter:

text

FRAMEWORK ABSTRACTION TAX
  Score              : 74/100 (high)
  Contributors:
   - Kernel launch fragmentation
   - Missing graph capture (opportunity) (inferred)
   - Unfused elementwise operations (opportunity) (inferred)

inferred means the change would likely help; it does not mean Fournex proved graph capture or fusion is disabled.

Minimal arch.yaml for custom roofline/MFU specs:

yaml

profiles:
  h100:
    peak_fp32_tflops: 60.0
    peak_memory_bw_gbps: 3900.0

03 After the first report

Once Fournex names the bottleneck and the first fix, use the same evidence loop to compare the implementation, benchmark the result, and create a text brief your LLM can use to make the code changes.

bash

# Compare source and evidence before trusting the change
frx compare baseline.cu optimized.cu

# Benchmark two CUDA kernels side by side
frx bench bad.cu good.cu

# Run a built-in reproducible proof case
frx case-study run uncoalesced_global_loads --emit-readme

# Create a shareable optimization brief from an NCU CSV
frx explain profile.csv --out ./brief/

# Or create the brief from a collected PyTorch training run
frx explain runs/my-run --out ./brief/

# Give this generated file to your LLM
# frx_llm_prompt.txt

Compare before benchmark

frx compare tells you what improved, what regressed, and what evidence is still missing before you rely on a speedup number.

Bench with synchronization

frx bench uses wall-clock timing, so your benchmark binary should call cudaDeviceSynchronize() before exit.

Reproduce a known case

frx case-study runs a bad-to-good kernel pair, validates the expected bottleneck resolution, checks for regressions, and can emit a README-backed proof bundle.

Use explain for handoff

frx explain accepts an NCU CSV or training run directory and writes frx_summary.txt for humans, frx_llm_prompt.txt for your LLM to edit the code, and frx_evidence.json for tools or dashboards.

All commands

The full CLI surface — use what your workflow needs.

frx profileProfile a workload

Analyze an Nsight Compute CSV, PTX file, or collected evidence and get a bottleneck verdict with ranked fixes.

# Get the exact NCU command Fournex expects
frx ncu-command -- ./my_binary

# Collect with NCU, then analyze the portable CSV
sudo $(frx ncu-command -- ./my_binary) > profile.csv
frx profile --ncu profile.csv --gpu-model H100

# Custom hardware specs for roofline/MFU math
frx profile --ncu profile.csv --gpu-model H100 --arch-profile arch.yaml

# Analyze source for the target GPU without running it
frx analyze kernel.cu --gpu-model H100

# No GPU available - static PTX analysis
frx profile --ptx kernel.ptx --gpu-model H100

frx collectCollect a run bundle

Sample GPU metrics, import profiler artifacts, and write a self-contained bundle you can analyze locally or upload.

frx collect --name my-run -- python train.py

frx analyzeAnalyze a bundle

Run the full analysis pipeline against a collected run bundle.

frx analyze runs/my-run
frx analyze kernel.cu --gpu-model H100 --arch-profile arch.yaml

# Or drag runs/my-run.zip onto fournex.com/analyze

frx compareCompare a fix

Diff a baseline and optimized CUDA kernel across source, PTX, and NCU evidence.

frx compare baseline.cu optimized.cu

# Add stronger evidence when available
frx compare baseline.cu optimized.cu --with-ptx --with-ncu
frx compare baseline.cu optimized.cu --ncu-a baseline.csv --ncu-b optimized.csv

# Optional: override architecture specs used by roofline/MFU scoring
frx compare baseline.cu optimized.cu --arch-profile arch.yaml

frx benchBenchmark kernels

Compile and wall-clock benchmark two .cu kernels side by side, with warmup discarded.

frx bench bad.cu good.cu

# Optional: add NCU bottleneck diffs
frx bench bad.cu good.cu --with-ncu

frx case-studyRun a proof bundle

Run a built-in bad-to-good kernel pair, validate the detected bottleneck is resolved, and emit a shareable proof bundle.

frx case-study list
frx case-study run uncoalesced_global_loads --emit-readme

frx explainCreate an optimization brief

Generate a human summary, LLM-ready prompt, and machine-readable evidence file from an NCU CSV or training run directory.

# NCU kernel path
frx explain profile.csv --out ./brief/
frx explain profile.csv --arch-profile arch.yaml --out ./brief/

# PyTorch training telemetry path
frx explain runs/my-run --out ./brief/

# Give frx_llm_prompt.txt to your LLM to make the code changes

frx tuneAuto-tune configs

Let the tune runner sweep compiler and runtime configs and find the fastest safe candidate.

frx tune --safe --max-trials 12 -- python train.py

frx doctorCheck your environment

Verify that ncu, CUDA drivers, and the CLI itself are wired up correctly.

frx doctor