FRX PROFILE READY·command frx profile·evidence NCU + PTX·output next fix
One command profiler

Find GPU mistakes.
Fix them with proof.

Run frx profile on a workload, NCU CSV, or PTX file. Fournex names the bottleneck, shows the measured threshold it crossed, and ranks the fix to try first.

to first report
1 cmd
evidence sources
NCU+PTX
time to first fix
<1 min

Works with

PyTorch·JAX·TensorFlow·NVIDIA
frx profile -- python train.py
ANALYZED
PRIMARY
uncoalesced
SECTORS/REQ
9.7
CONFIDENCE
high
Measured evidencethreshold: high > 4
loadL1L2SMissue
Recommended next actions
Top fix
Coalesce loads
Next
Reduce working set
Check
Tune occupancy
Ready
Validate rerun
CUDA streams — 12.4 msHotspot
ENG
gemm_fwd
attn_fwd
ln_fwd
gemm_bwd
attn_bwd
MEM
H2D
ckpt
D2H
adam
ELW
relu
add
gelu_bwd
bias_bwd
04ms8ms12ms
FwdBwdOpt
SM occupancy × timePeak 98%
SM0
SM1
SM2
SM3
SM4
SM5
SM6
SM7
IdleForwardBackwardOpt
idleactivepeak
Fournex report · why, actions, validation steps, caveats

Trusted by teams shipping the world's most compute-intensive workloads

AI21 labs
Perplexity
character.ai
Midjourney
Cohere
Anyscale
What we detect

Complex GPU mistakes. Named, ranked, and backed by evidence.

Fournex turns NCU counters, PTX structure, and runtime traces into the mistake that matters: the threshold crossed, the kernel it came from, and the fix to try first.

9.7 sectors/request, high > 4

Uncoalesced global loads

Catch memory access patterns that spray global-load transactions. Get coalescing actions and a re-profile command to confirm the fix.

View sample trace
L1 hit rate below 40%

L1 cache thrashing

Separate L1-only locality problems from broader memory pressure. Prioritize shared-memory tiling or working-set reduction based on evidence.

View sample trace
L2 hit rate below 50%

L2 cache thrashing

Identify L2-only pressure without confusing it with L1 misses. Fournex points engineers toward reducing the active working set first.

View sample trace
Low occupancy, register cap

Occupancy limited by registers

Use measured occupancy and launch-resource data to spot register pressure that blocks residency, then reduce live ranges or split kernels.

View sample trace
Too few eligible warps

Low scheduler utilization

Surface kernels that keep schedulers waiting even when occupancy looks acceptable. Separate ILP, block-size, and warp-eligibility issues.

View sample trace
GPU idle, CPU or copy bound

Host and input stalls

Connect runtime traces to practical fixes for dataloaders, pinned memory, transfer pressure, and hidden synchronization points.

View sample trace
Evidence to action

From profiler output to ranked fix list. Under a minute.

Every finding comes with the exact signal that triggered it, the threshold it crossed, confidence level, actions to take, validation steps, and caveats before you change code.

  • Measured metrics with [!!], [ !], [ok], and threshold hints
  • NCU, PTX, and runtime trace signals routed to one report
  • Recommendations include why, actions, validation, and risks
  • Next steps include the exact re-run command
frx-profile / attention_kernel
Analyzed
Primary
uncoalesced
L1 hit rate
31%
L2 hit rate
71%
#FIX / TITLEEVIDENCERISK
01
Coalesce global loads in attention_kernel
effort · Low · confidence · High
9.7 -> 2.1Low
02
Reduce active working set before touching tiling
effort · Low · confidence · High
L2-onlyLow
03
Lower register pressure to recover occupancy
effort · Medium · confidence · Medium
+18 ptsMedium
04
Remove host sync points in the training loop
effort · Low · confidence · High
low riskLow
Profile report with why, actions, validation, and caveats
Product evolution

Profiler today. Validation and automation next.

We don't ask engineers to trust blind automation. The product starts with evidence, then moves through ranked recommendations, benchmark validation, and guarded automation.

01 · Phase 1Shipping now

Profiler with opinions

  • Low-overhead trace collection
  • Deterministic bottleneck classifier
  • Normalized performance IR
02 · Phase 2Shipping now

Ranked recommendations

  • Map bottlenecks to concrete fixes
  • Score by impact, effort, and risk
  • Explainable, repeatable output
03 · Phase 3Early access

Experiment runner

  • Safe config sweeps
  • Before/after benchmark validation
  • Regression guardrails
04 · Phase 4On the roadmap

Policy-driven autopilot

  • Auto-apply within your guardrails
  • Continuous adaptation
  • Learned optimization policies
Capabilities

Built for engineers. Readable in the terminal.

Start with a concrete diagnosis, not a wall of counters. Keep the full evidence trail when you need to defend the change in review.

One command reports

Run live NCU, analyze an existing CSV, or inspect PTX without a GPU.

Evidence-backed fixes

Each recommendation names the metric, threshold, rule, and validation step.

Memory diagnostics

Separate L1 thrashing, L2 thrashing, bandwidth pressure, and uncoalesced loads.

Occupancy diagnosis

Tie low occupancy to registers, shared memory, block size, or scheduler pressure.

Before/after validation

Compare source, PTX, and NCU evidence side by side before trusting a change.

CI-native

Use CSV and JSON modes in automation. No dashboard required for the first answer.

Simple workflow

Run one command. Read the evidence. Fix the right thing.

Use live NCU profiling when you are on the GPU box, pass an existing CSV when the report came from CI, or inspect PTX when you only have compiled output.

Installpip install fournex
Live profilefrx profile -- python train.py
Existing NCUfrx profile --ncu ncu_report.csv
Static PTXfrx profile --ptx kernel.ptx
frx - profile.sh
$ frx profile --preset memory -- python train.py
VERDICT: uncoalesced_access, confidence=high
MEASURED: global load sectors/request 9.7[!!] high > 4
Next steps
Try rec_ncu_improve_coalescing, then re-run the same profile command.
[!!] L1 hit rate 31% low < 40%
[ok] L2 hit rate 71% healthy
[ !] occupancy limited by registers secondary
Top fixcoalesce loads
includes why, numbered actions, validation steps, and caveats
Engineer loop

Short path from suspicion to fix. No profiler archaeology.

Fournex keeps the path small: capture the evidence, name the mistake, apply the recommendation, and rerun the same command to verify the change.

1 cmd

to first report

3 modes

live, CSV, PTX

4 sections

verdict to next steps

0 guesswork

threshold evidence

Engineers do not need to interpret every counter by hand. The report explains why a metric is bad, which rule fired, what to change, and how to confirm the result.

View the docs
Moat

A workload-performance dataset that compounds every week.

We're not another rules engine. Every analyzed workload expands the mapping from trace patterns to validated fixes — turning usage into defensibility.

Proprietary trace → fix → outcome dataset
Validated optimization deltas across hardware
Policies that improve as more teams onboard
Trust layer: every change auditable and reversible
Compounding flywheel
01
More workloads analyzed
02
Richer optimization traces
03
Better policy learning
04
Stronger recommendations
05
More workloads onboard

Better data → better policies → better outcomes → more workloads. That loop is the product.