Uncoalesced global loads
Catch memory access patterns that spray global-load transactions. Get coalescing actions and a re-profile command to confirm the fix.
Run frx profile on a workload, NCU CSV, or PTX file. Fournex names the bottleneck, shows the measured threshold it crossed, and ranks the fix to try first.
Works with
Trusted by teams shipping the world's most compute-intensive workloads
Fournex turns NCU counters, PTX structure, and runtime traces into the mistake that matters: the threshold crossed, the kernel it came from, and the fix to try first.
Catch memory access patterns that spray global-load transactions. Get coalescing actions and a re-profile command to confirm the fix.
Separate L1-only locality problems from broader memory pressure. Prioritize shared-memory tiling or working-set reduction based on evidence.
Identify L2-only pressure without confusing it with L1 misses. Fournex points engineers toward reducing the active working set first.
Use measured occupancy and launch-resource data to spot register pressure that blocks residency, then reduce live ranges or split kernels.
Surface kernels that keep schedulers waiting even when occupancy looks acceptable. Separate ILP, block-size, and warp-eligibility issues.
Connect runtime traces to practical fixes for dataloaders, pinned memory, transfer pressure, and hidden synchronization points.
Every finding comes with the exact signal that triggered it, the threshold it crossed, confidence level, actions to take, validation steps, and caveats before you change code.
We don't ask engineers to trust blind automation. The product starts with evidence, then moves through ranked recommendations, benchmark validation, and guarded automation.
Start with a concrete diagnosis, not a wall of counters. Keep the full evidence trail when you need to defend the change in review.
Run live NCU, analyze an existing CSV, or inspect PTX without a GPU.
Each recommendation names the metric, threshold, rule, and validation step.
Separate L1 thrashing, L2 thrashing, bandwidth pressure, and uncoalesced loads.
Tie low occupancy to registers, shared memory, block size, or scheduler pressure.
Compare source, PTX, and NCU evidence side by side before trusting a change.
Use CSV and JSON modes in automation. No dashboard required for the first answer.
Use live NCU profiling when you are on the GPU box, pass an existing CSV when the report came from CI, or inspect PTX when you only have compiled output.
pip install fournexfrx profile -- python train.pyfrx profile --ncu ncu_report.csvfrx profile --ptx kernel.ptxFournex keeps the path small: capture the evidence, name the mistake, apply the recommendation, and rerun the same command to verify the change.
to first report
live, CSV, PTX
verdict to next steps
threshold evidence
Engineers do not need to interpret every counter by hand. The report explains why a metric is bad, which rule fired, what to change, and how to confirm the result.
View the docsWe're not another rules engine. Every analyzed workload expands the mapping from trace patterns to validated fixes — turning usage into defensibility.
Better data → better policies → better outcomes → more workloads. That loop is the product.