Open source · Apache 2.0

Stop wasting
70% of your GPU.

Profile your training and inference jobs, get the bottleneck named for you, and ship the highest-ROI fix - validated by safe experiments, not hope.

typical GPU waste70%
throughput uplift+58%
time to first fix<1 min

Built for production. Works with PyTorch, JAX, TensorFlow, and more.

PyTorchJAXTensorFlowNVIDIA
GPU Performance Autopilot
Continuous optimization plane
Live
Performance improvement
+41%
Throughput
High confidence
RL policy active
Utilization
92%
Queue depth
14 jobs
Policy drift
Stable
60%40%20%0%
Throughput trend
High confidence
Current uplift
+41% throughput
May 12May 19May 26Jun 2Jun 9
Optimizations applied
Kernel fusion
+18%
Memory layout
+11%
Launch config
+8%
4 more optimizations
Queued
Workload profiling
Hotspot detected
SM utilization62%
Memory BW78%
DRAM stalls34%
Kernel launches1.2M
SM block × time step
Peak 98%
SM0
SM1
SM2
SM3
SM4
SM5
T1
T2
T3
T4
T5
T6
T7
T8
ColdHot
Autopilot agent (RL)
Observe
(Profile)
Decide
(RL / Search)
Act
(Apply / Suggest)
Reward
(Performance)

Trusted by teams shipping the world's most compute-intensive workloads

AI21 labs
Perplexity
character.ai
Midjourney
Cohere
Anyscale
What we detect

Six GPU bottleneck families. Named, ranked, and fixable.

We skip the raw profiler dumps and go straight to diagnosis. Every bottleneck maps to a concrete, explainable fix; not another dashboard to stare at.

GPU idle while CPU workers stall

Dataloader starvation

Detect when the input pipeline can't feed the device. Tune workers, prefetching, and pinned memory before re-running.

See sample trace
Low SM occupancy under real batch sizes

Small-batch inefficiency

Pinpoint shapes that under-utilize kernels. Recommend batching, padding, or recompile-safe shape hints.

See sample trace
Frequent .item() and blocking transfers

Host-device sync overhead

Find hidden synchronization stalls — scalar reads, premature .cpu() calls, and chatty copy patterns.

See sample trace
FP32 where AMP is safe and faster

Mixed-precision opportunities

Identify layers and ops that can move to bf16/fp16 without loss of accuracy, ranked by expected speedup.

See sample trace
Many tiny kernels, launch overhead dominates

Kernel launch fragmentation

Spot fusion candidates for torch.compile, CUDA graphs, or Triton — with concrete before/after projections.

See sample trace
OOM risk, swaps, cache thrashing

Memory pressure & fragmentation

Track allocator behavior, fragmentation, and layout choices that silently cap throughput ceiling.

See sample trace
Ranked recommendations

From trace to ranked fix list in under a minute.

Every finding comes with an estimated speedup, implementation effort, confidence level, and blast radius. No more guessing which fix matters first.

  • Explainable, rule-based classifier — no hallucinations
  • ROI-weighted ranking using real workload outcomes
  • Safe-to-apply set separated from numerics-changing set
  • One-click repro: every fix ships with a benchmark harness
trace-14c2b / resnet50_train
Analyzed
GPU active
42%
Potential uplift
+68%
Est. monthly save
$12.4k
  • 01
    Increase dataloader workers 4 → 12, enable pinned memory
    Effort · LowConfidence · HighRisk · Low
    +28%
    speedup
  • 02
    Enable bf16 mixed precision on forward pass
    Effort · LowConfidence · HighRisk · Low
    +19%
    speedup
  • 03
    Compile hot module with torch.compile (mode="reduce-overhead")
    Effort · MediumConfidence · MediumRisk · Medium
    +14%
    speedup
  • 04
    Remove per-step .item() calls from training loop
    Effort · LowConfidence · HighRisk · Low
    +6%
    speedup
Live recommendation streamreport_7a1.json
Product evolution

Profiler today. Autopilot tomorrow. Trust built in at every step.

We don't ship blind automation. The product walks from diagnosis to optimization to full autopilot — each phase validated by real workload outcomes before the next one turns on.

Phase 1Shipping now

Profiler with opinions

  • Low-overhead trace collection
  • Deterministic bottleneck classifier
  • Normalized performance IR
01 · closed loop
Phase 2Shipping now

Ranked recommendations

  • Map bottlenecks to concrete fixes
  • Score by ROI, effort, and risk
  • Explainable, repeatable output
02 · closed loop
Phase 3Early access

Experiment runner

  • Safe config sweeps
  • Before/after benchmark validation
  • Regression guardrails
03 · closed loop
Phase 4On the roadmap

Policy-driven autopilot

  • Auto-apply within your guardrails
  • Continuous adaptation
  • Learned optimization policies
04 · closed loop
Capabilities

Built for platform engineers operating production GPU fleets.

Telemetry fidelity, explainable ranking, and controlled rollout — the product surface is designed for technical teams, not executive dashboards.

Production profiling

Low-overhead telemetry for kernels, streams, allocators, and memory traffic.

Ranked fixes

Concrete, explainable recommendations — scored by expected ROI, effort, and risk.

Safe experiment runner

Sweep configs safely, validate every trial, and stop bad runs early.

Regression guardrails

Throughput, memory, loss divergence, and NaN checks — so trust comes built in.

Before/after validation

Every applied change ships with an auditable benchmark delta and reproducibility hash.

CI-native

Run as a CLI, in CI, or as a continuous agent. No hardware changes, no vendor lock-in.

Drop in. Go.

One command. No hardware changes. No vendor lock-in.

Works as a CLI, a CI job, or a long-running agent. Bring your own cluster, keep your own training code. We just turn traces into ranked fixes.

Installpip install fournex
Profilefournex analyze --pid $TRAIN_PID
Validatefournex bench --apply top-3
fournex ~ run.sh
$ fournex analyze --pid 4492
trace captured in 18.2s (low-overhead mode)
classifier ran against 6 bottleneck families
Primary bottleneck
Dataloader starvation · GPU idle 58% of step time
$ fournex bench --apply top-3
trial 1 · workers=12, pin_memory=on +28.4%
trial 2 · bf16 amp +19.1%
trial 3 · torch.compile reduce-overhead +14.6%
Final validated speedup+61.2%
no loss divergence · no OOM · reproducible across 3 seeds
ROI

Immediate infrastructure leverage. Not a long science project.

Cut wasted GPU spend, increase throughput, and shrink the manual tuning backlog. No migrations. No new hardware. Just measurable deltas in production.

Up to 35%

lower GPU cost

20–50%

throughput gains

Weeks → hours

faster tuning cycles

0 new hardware

required for ROI

A team running a $1.2M/yr GPU training budget typically recovers $280k–$420k in the first quarter after onboarding — before any code changes beyond recommended ones.
Model your savings
Moat

A workload-performance dataset that compounds every week.

We're not another rules engine. Every analyzed workload expands the mapping from trace patterns to validated fixes — turning usage into defensibility.

Proprietary trace → fix → outcome dataset
Validated optimization deltas across hardware
Policies that improve as more teams onboard
Trust layer: every change auditable and reversible
01
More workloads analyzed
02
Richer optimization traces
03
Better policy learning
04
Stronger recommendations
05
More workloads onboard

Better data → better policies → better outcomes → more workloads. That loop is the product.

Early access · Open source

Turn wasted GPU compute into measurable performance gains.

See where your GPU efficiency leaks today, what can be tuned automatically, and how quickly those gains can land in production. First report is free.

No code changes to onboard
Works with PyTorch + NVIDIA today
Apache 2.0 license

We'll reply within one business day with next steps.