Uncoalesced global loads
Catch memory access patterns that spray global-load transactions. Get coalescing actions and a re-profile command to confirm the fix.
Run frx profile on a workload, NCU CSV, or PTX file. Fournex names the bottleneck, shows which kernel is worth fixing first, and ranks the fix to try.
Opportunity-ranked kernels, framework tax signals, and architecture-aware thresholds for H100, L4, RTX4090, A100, and T4.
Works with
Trusted by teams shipping the world's most compute-intensive workloads
Fournex turns NCU counters, PTX structure, and runtime traces into the mistake that matters: the threshold crossed, the kernel worth fixing first, and the validation command to prove the change.
Catch memory access patterns that spray global-load transactions. Get coalescing actions and a re-profile command to confirm the fix.
Separate L1-only locality problems from broader memory pressure. Prioritize shared-memory tiling or working-set reduction based on evidence.
Identify L2-only pressure without confusing it with L1 misses. Fournex points engineers toward reducing the active working set first.
Use measured occupancy and launch-resource data to spot register pressure that blocks residency, then reduce live ranges or split kernels.
Surface kernels that keep schedulers waiting even when occupancy looks acceptable. Separate ILP, block-size, and warp-eligibility issues.
Connect runtime traces to practical fixes for dataloaders, pinned memory, transfer pressure, and hidden synchronization points.
Rank kernels by runtime share, roofline region, MFU gap, and severity so the highest-impact work rises above small noisy offenders.
Surface framework overhead that is not explained by input, copy, or sync stalls, with inferred graph-capture and fusion opportunities clearly marked.
Every finding comes with the exact signal that triggered it, the threshold it crossed, confidence level, actions to take, validation steps, and caveats before you change code.
We don't ask engineers to trust blind automation. The product starts with evidence, then moves through ranked recommendations, before/after validation, benchmark proof, and guarded automation.
Start with a concrete diagnosis, not a wall of counters. Keep the full evidence trail, from kernel opportunity ranking to benchmark proof, when you need to defend the change in review.
Run live NCU, analyze an existing CSV, or inspect PTX without a GPU.
Each recommendation names the metric, threshold, rule, and validation step.
Separate L1 thrashing, L2 thrashing, bandwidth pressure, and uncoalesced loads.
Tie low occupancy to registers, shared memory, block size, or scheduler pressure.
Compare source, PTX, NCU, and bench evidence before trusting a change.
Rank kernels by runtime share, roofline region, MFU gap, and severity.
Spot launch fragmentation and inferred graph-capture or fusion opportunities.
Turn an NCU CSV or training run directory into a measured, bottleneck-specific optimization prompt.
Profile CUDA kernels with live NCU, an existing CSV, or PTX. For PyTorch training workloads, collect runtime telemetry and turn the run directory into an LLM-ready brief. Then compare and benchmark the fix before you trust it.
pip install fournexfrx collect --name my-run -- python train.pyfrx explain runs/my-run --out ./brief/frx ncu-command -- ./my_binaryfrx profile -- python train.pyfrx profile --ncu ncu_report.csvfrx profile --ptx kernel.ptxfrx compare baseline.cu optimized.cufrx bench bad.cu good.cufrx explain report.csv --out ./brief/Fournex keeps the path small: capture the evidence, name the mistake, apply the recommendation, and rerun the same command to verify the change.
to first report
live, CSV, PTX
verdict to next steps
threshold evidence
Engineers do not need to interpret every counter by hand. The report explains why a metric is bad, which rule fired, what to change, and how to confirm the result.
View the docsWe're not another rules engine. Every analyzed workload expands the mapping from trace patterns to validated fixes — turning usage into defensibility.
Better data → better policies → better outcomes → more workloads. That loop is the product.