CLI Quick Start

From zero to profiling in two minutes.

Install frx, collect Nsight Compute data with the right counters, then analyze the CSV anywhere Python runs. NCU collects the data; Fournex explains what to fix.

Prerequisites

Python 3.11+ is required. PyTorch is optional - the CLI works without it for static analysis. For live kernel profiling you need ncu (Nsight Compute) on your PATH where you collect data.

  • Python >= 3.11
  • NVIDIA GPU + CUDA driver (for live profiling)
  • Nsight Compute - ncu on PATH (for live profiling)
  • PyTorch (only when the SDK instruments a live training run)
01 Install

Install from PyPI. This registers both frx and fournex as executables.

bash
pip install fournex

Verify the install:

bash
frx doctor
02 Collect with NCU, analyze with Fournex

NCU reads hardware counters. Fournex analyzes the CSV and turns those counters into bottlenecks, evidence, and ranked fixes. These are two steps because NCU often needs admin or root privileges to access GPU performance counters. Fournex cannot safely hide that privilege boundary, but it can give you the exact NCU command to run.

bash
# 1. Print the exact NCU command Fournex expects
frx ncu-command -- ./my_binary

# 2. Run that NCU command with the privileges your system requires
sudo $(frx ncu-command -- ./my_binary) > profile.csv

# 3. Analyze the CSV with Fournex
frx profile --ncu profile.csv --gpu-model H100

# Optional: override peak FLOPs / memory bandwidth for roofline and MFU math
frx profile --ncu profile.csv --gpu-model H100 --arch-profile arch.yaml

# Optional: inspect CUDA source for the GPU you deploy on
frx analyze kernel.cu --gpu-model H100

# Optional: no GPU available, inspect PTX statically
frx profile --ptx kernel.ptx --gpu-model H100

Use frx ncu-command first

If you run NCU manually with the wrong metrics, Fournex can only produce partial results and DATA NOTES warnings. frx ncu-command prints the metric set Fournex expects.

Windows and WSL2 caveat

Consumer NVIDIA GPUs in default Windows WDDM mode do not expose NCU hardware counters to NCU running inside WSL2. Run NCU from a native Windows terminal with Nsight Compute installed, or collect on a true Linux machine.

The CSV is portable

Collect profile.csv on a server, cluster, or teammate's machine. Then analyze it on your laptop with frx profile --ncu profile.csv. No GPU is required at analysis time.

Pass the target GPU model

--gpu-model changes thresholds, not just labels. Without it, Fournex uses Ampere defaults. Pass H100, L4, RTX4090, T4, or the SM family when analyzing code for that GPU.

Override hardware specs when needed

--arch-profile <path.yaml> is available on analyze, profile, compare, and explain. Use it for custom hardware, pre-production SKUs, or correcting peak TFLOP/s and memory-bandwidth assumptions used by roofline and MFU calculations.

Bad paths fail cleanly

A typo in --arch-profile or --config now prints frx: <message> and exits with code 1 instead of dumping a Python traceback.

NCU 2026.x CSV works

Fournex handles older tall CSVs and newer wide CSVs automatically, including the final done line that NCU writes after the report.

The report sections you'll see:

VERDICTPrimary bottleneck and confidence.
MEASURED METRICSEvery metric annotated with [!!], [ !], [ok], or [--].
BOTTLENECKS DETECTEDRanked bottlenecks with numeric scores.
FRAMEWORK ABSTRACTION TAXRuntime overhead score when profiler telemetry is available.
RECOMMENDATIONSOrdered fixes with why, actions, and validation steps.
NEXT STEPSThe exact re-run command to validate the top fix.

Recommendation cards include a validation plan with the current measured value when that metric was present in the NCU CSV:

text
Validate:
  ncu --metrics smsp__pipe_tensor_op_hmma_cycles_active.avg.pct_of_peak_sustained_active --csv ./your_app
  --> Tensor core utilization %: was 7.0; rises above 50% (target: 50.0)

If the CSV did not include that metric, Fournex omits the was X hint instead of inventing a value.

If profiler telemetry is available, the report may also include Framework Abstraction Tax. This is a runtime overhead signal, not an NCU-only hardware counter:

text
FRAMEWORK ABSTRACTION TAX
  Score              : 74/100 (high)
  Contributors:
   - Kernel launch fragmentation
   - Missing graph capture (opportunity) (inferred)
   - Unfused elementwise operations (opportunity) (inferred)

inferred means the change would likely help; it does not mean Fournex proved graph capture or fusion is disabled.

Minimal arch.yaml for custom roofline/MFU specs:

yaml
profiles:
  h100:
    peak_fp32_tflops: 60.0
    peak_memory_bw_gbps: 3900.0
03 After the first report

Once Fournex names the bottleneck and the first fix, use the same evidence loop to compare the implementation, benchmark the result, and create a text brief your LLM can use to make the code changes.

bash
# Compare source and evidence before trusting the change
frx compare baseline.cu optimized.cu

# Benchmark two CUDA kernels side by side
frx bench bad.cu good.cu

# Create a shareable optimization brief from profiler output
frx explain --ncu profile.csv

# Give this generated file to your LLM
# frx_llm_prompt.txt

Compare before benchmark

frx compare tells you what improved, what regressed, and what evidence is still missing before you rely on a speedup number.

Bench with synchronization

frx bench uses wall-clock timing, so your benchmark binary should call cudaDeviceSynchronize() before exit.

Use explain for handoff

frx explain writes frx_summary.txt for humans, frx_llm_prompt.txt for your LLM to edit the code, and frx_evidence.json for tools or dashboards.

All commands

The full CLI surface — use what your workflow needs.

frx profileProfile a workload

Analyze an Nsight Compute CSV, PTX file, or collected evidence and get a bottleneck verdict with ranked fixes.

# Get the exact NCU command Fournex expects
frx ncu-command -- ./my_binary

# Collect with NCU, then analyze the portable CSV
sudo $(frx ncu-command -- ./my_binary) > profile.csv
frx profile --ncu profile.csv --gpu-model H100

# Custom hardware specs for roofline/MFU math
frx profile --ncu profile.csv --gpu-model H100 --arch-profile arch.yaml

# Analyze source for the target GPU without running it
frx analyze kernel.cu --gpu-model H100

# No GPU available - static PTX analysis
frx profile --ptx kernel.ptx --gpu-model H100
frx collectCollect a run bundle

Sample GPU metrics, import profiler artifacts, and write a self-contained bundle you can analyze locally or upload.

frx collect -- python train.py
frx analyzeAnalyze a bundle

Run the full analysis pipeline against a collected run bundle.

frx analyze runs/run-<id>
frx analyze kernel.cu --gpu-model H100 --arch-profile arch.yaml

# Or drag runs/run-<id>.zip onto fournex.com/analyze
frx compareCompare a fix

Diff a baseline and optimized CUDA kernel across source, PTX, and NCU evidence.

frx compare baseline.cu optimized.cu

# Add stronger evidence when available
frx compare baseline.cu optimized.cu --with-ptx --with-ncu
frx compare baseline.cu optimized.cu --ncu-a baseline.csv --ncu-b optimized.csv

# Optional: override architecture specs used by roofline/MFU scoring
frx compare baseline.cu optimized.cu --arch-profile arch.yaml
frx benchBenchmark kernels

Compile and wall-clock benchmark two .cu kernels side by side, with warmup discarded.

frx bench bad.cu good.cu

# Optional: add NCU bottleneck diffs
frx bench bad.cu good.cu --with-ncu
frx explainCreate an optimization brief

Generate a human summary, LLM-ready prompt, and machine-readable evidence file from profiler output.

frx explain --ncu profile.csv
frx explain --ncu profile.csv --arch-profile arch.yaml

# Give frx_llm_prompt.txt to your LLM to make the code changes
frx tuneAuto-tune configs

Let autopilot sweep compiler and runtime configs and find the fastest safe candidate.

frx tune --safe --max-trials 12 -- python train.py
frx doctorCheck your environment

Verify that ncu, CUDA drivers, and the CLI itself are wired up correctly.

frx doctor