FRX PROFILE READY·command frx profile·evidence NCU + PTX·model H100 / L4 / RTX4090·loop profile -> attribute -> bench

One command profiler

Find GPU mistakes.
Fix them with proof.

Run frx profile on a workload, NCU CSV, or PTX file. Fournex names the bottleneck, shows which kernel is worth fixing first, and ranks the fix to try.

Opportunity-ranked kernels, framework tax signals, and architecture-aware thresholds for H100, L4, RTX4090, A100, and T4.

to first report

1 cmd

evidence sources

NCU+PTX

time to first fix

<1 min

See how it works View on GitHub

Works with

PyTorch·JAX·TensorFlow·NVIDIA

frx profile -- python train.py

ANALYZED

PRIMARY

uncoalesced

SECTORS/REQ

9.7

CONFIDENCE

high

Measured evidencethreshold: high > 4

loadL1L2SMissue

Recommended next actions

Top fix

Coalesce loads

Reduce working set

Check

Tune occupancy

Ready

Validate rerun

CUDA streams — 12.4 msHotspot

ENG

gemm_fwd

attn_fwd

ln_fwd

gemm_bwd

attn_bwd

MEM

H2D

ckpt

D2H

adam

ELW

relu

add

gelu_bwd

bias_bwd

04ms8ms12ms

FwdBwdOpt

SM occupancy × timePeak 98%

SM0

SM1

SM2

SM3

SM4

SM5

SM6

SM7

IdleForwardBackwardOpt

idleactivepeak

Fournex report · why, actions, validation steps, caveats

Trusted by teams shipping the world's most compute-intensive workloads

AI21 labs

Perplexity

character.ai

Midjourney

Cohere

Anyscale

What we detect

Complex GPU mistakes. Named, ranked, and backed by evidence.

Fournex turns NCU counters, PTX structure, and runtime traces into the mistake that matters: the threshold crossed, the kernel worth fixing first, and the validation command to prove the change.

9.7 sectors/request, high > 4

Uncoalesced global loads

Catch memory access patterns that spray global-load transactions. Get coalescing actions and a re-profile command to confirm the fix.

View sample trace→

L1 hit rate below 40%

L1 cache thrashing

Separate L1-only locality problems from broader memory pressure. Prioritize shared-memory tiling or working-set reduction based on evidence.

View sample trace→

L2 hit rate below 50%

L2 cache thrashing

Identify L2-only pressure without confusing it with L1 misses. Fournex points engineers toward reducing the active working set first.

View sample trace→

Low occupancy, register cap

Occupancy limited by registers

Use measured occupancy and launch-resource data to spot register pressure that blocks residency, then reduce live ranges or split kernels.

View sample trace→

Too few eligible warps

Low scheduler utilization

Surface kernels that keep schedulers waiting even when occupancy looks acceptable. Separate ILP, block-size, and warp-eligibility issues.

View sample trace→

GPU idle, CPU or copy bound

Host and input stalls

Connect runtime traces to practical fixes for dataloaders, pinned memory, transfer pressure, and hidden synchronization points.

View sample trace→

Runtime share x MFU gap

Kernel opportunity ranking

Rank kernels by runtime share, roofline region, MFU gap, and severity so the highest-impact work rises above small noisy offenders.

View sample trace→

Launch fragmentation, idle gaps

Framework abstraction tax

Surface framework overhead that is not explained by input, copy, or sync stalls, with inferred graph-capture and fusion opportunities clearly marked.

View sample trace→

Evidence to action

From profiler output to ranked fix list. Under a minute.

Every finding comes with the exact signal that triggered it, the threshold it crossed, confidence level, actions to take, validation steps, and caveats before you change code.

Measured metrics with [!!], [ !], [ok], and threshold hints
NCU, PTX, and runtime trace signals routed to one report
Recommendations include why, actions, validation, and risks
Next steps include the exact re-run command

Read profile docs Product roadmap

frx-profile / attention_kernel

Analyzed

Primary

uncoalesced

L1 hit rate

31%

L2 hit rate

71%

#FIX / TITLEEVIDENCERISK

Coalesce global loads in attention_kernel

effort · Low · confidence · High

9.7 -> 2.1Low

Reduce active working set before touching tiling

effort · Low · confidence · High

L2-onlyLow

Lower register pressure to recover occupancy

effort · Medium · confidence · Medium

+18 ptsMedium

Remove host sync points in the training loop

effort · Low · confidence · High

low riskLow

Profile report with why, actions, validation, and caveats

Product evolution

Profiling and validation today. Guarded automation next.

We don't ask engineers to trust blind automation. The product starts with evidence, then moves through ranked recommendations, before/after validation, benchmark proof, and guarded automation.

01 · Phase 1Shipping now

Profiler with opinions

Low-overhead trace collection
Deterministic bottleneck classifier
Normalized performance IR

02 · Phase 2Shipping now

Ranked recommendations

Map bottlenecks to concrete fixes
Score by impact, effort, and risk
Explainable, repeatable output

03 · Phase 3Shipping now

Compare and bench validation

Before/after evidence diffs
Warmup-discarded kernel bench
LLM-ready briefs for NCU and training runs

04 · Phase 4On the roadmap

Policy-driven autopilot

Auto-apply within your guardrails
Continuous adaptation
Learned optimization policies

Capabilities

Built for engineers. Readable in the terminal.

Start with a concrete diagnosis, not a wall of counters. Keep the full evidence trail, from kernel opportunity ranking to benchmark proof, when you need to defend the change in review.

One command reports

Run live NCU, analyze an existing CSV, or inspect PTX without a GPU.

Evidence-backed fixes

Each recommendation names the metric, threshold, rule, and validation step.

Memory diagnostics

Separate L1 thrashing, L2 thrashing, bandwidth pressure, and uncoalesced loads.

Occupancy diagnosis

Tie low occupancy to registers, shared memory, block size, or scheduler pressure.

Before/after validation

Compare source, PTX, NCU, and bench evidence before trusting a change.

Kernel opportunity scoring

Rank kernels by runtime share, roofline region, MFU gap, and severity.

Framework tax signals

Spot launch fragmentation and inferred graph-capture or fusion opportunities.

LLM-ready briefs

Turn an NCU CSV or training run directory into a measured, bottleneck-specific optimization prompt.

Simple workflow

Start with one command. Prove the fix.

Profile CUDA kernels with live NCU, an existing CSV, or PTX. For PyTorch training workloads, collect runtime telemetry and turn the run directory into an LLM-ready brief. Then compare and benchmark the fix before you trust it.

Installpip install fournex

Training runfrx collect --name my-run -- python train.py

Training brieffrx explain runs/my-run --out ./brief/

NCU commandfrx ncu-command -- ./my_binary

Live profilefrx profile -- python train.py

Existing NCUfrx profile --ncu ncu_report.csv

Static PTXfrx profile --ptx kernel.ptx

Compare fixfrx compare baseline.cu optimized.cu

Bench kernelsfrx bench bad.cu good.cu

NCU brieffrx explain report.csv --out ./brief/

frx - profile.sh

$ frx profile --preset memory -- python train.py

VERDICT: uncoalesced_access, confidence=high

MEASURED: global load sectors/request 9.7[!!] high > 4

Next steps

Try rec_ncu_improve_coalescing, then re-run the same profile command.

[!!] L1 hit rate 31% low < 40%

[ok] L2 hit rate 71% healthy

[ !] occupancy limited by registers secondary

Top fixcoalesce loads

includes why, numbered actions, validation steps, and caveats

Engineer loop

Short path from suspicion to fix. No profiler archaeology.

Fournex keeps the path small: capture the evidence, name the mistake, apply the recommendation, and rerun the same command to verify the change.

1 cmd

to first report

3 modes

live, CSV, PTX

4 sections

verdict to next steps

0 guesswork

threshold evidence

Engineers do not need to interpret every counter by hand. The report explains why a metric is bad, which rule fired, what to change, and how to confirm the result.

View the docs

Moat

A workload-performance dataset that compounds every week.

We're not another rules engine. Every analyzed workload expands the mapping from trace patterns to validated fixes — turning usage into defensibility.

Proprietary trace → fix → outcome dataset

Validated optimization deltas across hardware

Policies that improve as more teams onboard

Trust layer: every change auditable and reversible

Compounding flywheel

More workloads analyzed

Richer optimization traces

Better policy learning

Stronger recommendations

More workloads onboard

Better data → better policies → better outcomes → more workloads. That loop is the product.

Find GPU mistakes.Fix them with proof.

Complex GPU mistakes. Named, ranked, and backed by evidence.

Uncoalesced global loads

L1 cache thrashing

L2 cache thrashing

Occupancy limited by registers

Low scheduler utilization

Host and input stalls

Kernel opportunity ranking

Framework abstraction tax

From profiler output to ranked fix list. Under a minute.

Profiling and validation today. Guarded automation next.

Profiler with opinions

Ranked recommendations

Compare and bench validation

Policy-driven autopilot

Built for engineers. Readable in the terminal.

One command reports

Evidence-backed fixes

Memory diagnostics

Occupancy diagnosis

Before/after validation

Kernel opportunity scoring

Framework tax signals

LLM-ready briefs

Start with one command. Prove the fix.

Short path from suspicion to fix. No profiler archaeology.

A workload-performance dataset that compounds every week.

Find GPU mistakes.
Fix them with proof.