AUTOPILOT ACTIVE·cluster prod-a100-01·jobs 14 running·uplift +41%
Open source

Stop wasting
70% of your GPU.

Profile training and inference jobs, get the bottleneck named for you, and ship the highest-ROI fix, validated by safe experiments, not hope.

typical GPU waste
70%
throughput uplift
+58%
time to first fix
<1 min

Works with

PyTorch·JAX·TensorFlow·NVIDIA
fournex — autopilot
LIVE
THROUGHPUT
+41%
UTILIZATION
92%
JOBS
14
Throughput trendHigh confidence
May 12May 19May 26Jun 2Jun 9
Optimizations applied
+18%
Kernel fusion
+11%
Memory layout
+8%
Launch config
Queued
4 more optimizations
Workload profilingHotspot
SM utilization62%
Memory BW78%
DRAM stalls34%
Kernel launches1.2M
SM block × timeHotspot
lowmidhigh
Fournex Autopilot · policy_v4 · last action 2m ago

Trusted by teams shipping the world's most compute-intensive workloads

AI21 labs
Perplexity
character.ai
Midjourney
Cohere
Anyscale
What we detect

Six GPU bottleneck families. Named, ranked, and fixable.

We skip the raw profiler dumps and go straight to diagnosis. Every bottleneck maps to a concrete, explainable fix, not another dashboard to stare at.

GPU idle while CPU workers stall

Dataloader starvation

Detect when the input pipeline can't feed the device. Tune workers, prefetching, and pinned memory before re-running.

View sample trace
Low SM occupancy under real batch sizes

Small-batch inefficiency

Pinpoint shapes that under-utilize kernels. Recommend batching, padding, or recompile-safe shape hints.

View sample trace
Frequent .item() and blocking transfers

Host-device sync overhead

Find hidden synchronization stalls — scalar reads, premature .cpu() calls, and chatty copy patterns.

View sample trace
FP32 where AMP is safe and faster

Mixed-precision opportunities

Identify layers and ops that can move to bf16/fp16 without loss of accuracy, ranked by expected speedup.

View sample trace
Many tiny kernels, launch overhead dominates

Kernel launch fragmentation

Spot fusion candidates for torch.compile, CUDA graphs, or Triton — with concrete before/after projections.

View sample trace
OOM risk, swaps, cache thrashing

Memory pressure & fragmentation

Track allocator behavior, fragmentation, and layout choices that silently cap throughput ceiling.

View sample trace
Ranked recommendations

From trace to ranked fix list. Under a minute.

Every finding comes with expected impact, effort, confidence level, and blast radius. Tune mode benchmarks selected candidates before you trust the change.

  • Explainable, rule-based classifier — no hallucinations
  • Ranked by diagnosis strength, expected impact, effort, and risk
  • Recommendations separated from benchmarked tune candidates
  • Tune mode validates selected configs with guardrails
trace-14c2b / resnet50_train
Analyzed
GPU active
42%
Potential uplift
+68%
Est. save/mo
$12.4k
#FIX / TITLEUPLIFTRISK
01
Increase dataloader workers 4 → 12, enable pinned memory
effort · Low · confidence · High
+28%Low
02
Increase batch size from inferred baseline if memory allows
effort · Low · confidence · High
+14%Low
03
torch.compile(mode="reduce-overhead") for launch overhead
effort · Medium · confidence · Medium
+14%Medium
04
Reduce .item() / host sync points in the training loop
effort · Low · confidence · High
+6%Low
Live recommendation stream · report_7a1.json
Product evolution

Profiler today. Autopilot tomorrow. Trust built in.

We don't ship blind automation. The product walks from diagnosis to optimization to full autopilot — each phase validated by real workload outcomes before the next one turns on.

01 · Phase 1Shipping now

Profiler with opinions

  • Low-overhead trace collection
  • Deterministic bottleneck classifier
  • Normalized performance IR
02 · Phase 2Shipping now

Ranked recommendations

  • Map bottlenecks to concrete fixes
  • Score by impact, effort, and risk
  • Explainable, repeatable output
03 · Phase 3Early access

Experiment runner

  • Safe config sweeps
  • Before/after benchmark validation
  • Regression guardrails
04 · Phase 4On the roadmap

Policy-driven autopilot

  • Auto-apply within your guardrails
  • Continuous adaptation
  • Learned optimization policies
Capabilities

Built for platform engineers. Operating production GPU fleets.

Telemetry fidelity, explainable ranking, and controlled rollout — designed for technical teams, not executive dashboards.

Production profiling

Low-overhead telemetry for kernels, streams, allocators, and memory traffic.

Ranked fixes

Concrete, explainable recommendations — scored by impact, effort, and risk.

Safe experiment runner

Sweep configs safely, validate every trial, and stop bad runs early.

Regression guardrails

Throughput, memory, loss divergence, and NaN checks — trust built in.

Before/after validation

Every change ships with an auditable benchmark delta and reproducibility hash.

CI-native

Run as a CLI, in CI, or as a continuous agent. No hardware changes required.

Drop in. Go.

Collect, diagnose, and race safe fixes from the CLI.

Works as a CLI, a CI job, or a long-running agent. Bring your own cluster, keep your own training code. The tuner screens cheap candidates first, then fully validates the strongest ones.

Installpip install -e ./backend/python
Collectfrx collect -- python train.py
Tunefrx tune --no-safe --race-promote-count 3 -- python train.py
frx — tune.sh
$ frx tune --no-safe --race-promote-count 3 -- python train.py
Baseline captured: 4.80 steps/sec, batch_size=32
Generated 8 candidates from input_bound diagnosis
Quick race stage
8 short trials screened → 3 full benchmarks
[RACE] dl:nw=8,pin=T +18.1% promoted
[RACE] bs:40 +14.0% promoted
[FULL] dl:nw=8,pin=T,pf=4 +24.7%
Winner from full benchmark+24.7%
no loss divergence · no OOM · auditable hash
ROI

Immediate infrastructure leverage. Not a long science project.

Cut wasted GPU spend, increase throughput, and shrink the manual tuning backlog. No migrations. No new hardware.

Up to 35%

lower GPU cost

20–50%

throughput gains

Weeks → hr

faster tuning cycles

0 new HW

required for ROI

A team running a $1.2M/yr GPU training budget typically recovers $280k–$420k in the first quarter after onboarding — before any code changes beyond recommended ones.

Model your savings
Moat

A workload-performance dataset that compounds every week.

We're not another rules engine. Every analyzed workload expands the mapping from trace patterns to validated fixes — turning usage into defensibility.

Proprietary trace → fix → outcome dataset
Validated optimization deltas across hardware
Policies that improve as more teams onboard
Trust layer: every change auditable and reversible
Compounding flywheel
01
More workloads analyzed
02
Richer optimization traces
03
Better policy learning
04
Stronger recommendations
05
More workloads onboard

Better data → better policies → better outcomes → more workloads. That loop is the product.

Early access · Open source

Turn wasted GPU compute into measurable performance gains.

See where your GPU efficiency leaks today, what can be tuned automatically, and how quickly those gains can land in production. First report is free.

No code changes to onboard
Works with PyTorch + NVIDIA today
Apache 2.0 license
Request early access

We'll reply within one business day with next steps.