This project asks whether low-cost machine learning can help navigate chip design trade-offs in power, performance, and area. Every OpenROAD run is expensive in time, so the goal is to learn from a small but carefully curated set of real configurations.
The final study covers 23 real OpenROAD configurations across four ASAP7 RTL designs — AES, Ibex, JPEG, and RISC-V 32I — and reports six pre-registered findings. The story is no longer “placeholder graphs” or “early promise.” Real netlists were extracted, evaluated, and used to train a graph surrogate that now reaches MAE = 0.153.
The framework matured through five phases: bootstrap, reinforcement learning, GAT plus Bayesian optimization, EDA calibration, and a final C+ probe. That sequence revealed role separation in RL, measured surrogate quality with leave-one-design-out validation, exposed ORFS version sensitivity, and confirmed 7/7 Pareto candidates against newer toolchain baselines.
All experiments were run on CPU-only AWS infrastructure for a total compute cost of $8.35. The public-facing lesson is simple: careful methodology, exact tool-version reporting, and sanity checks matter as much as the ML model itself.
The work is organized around pre-registered findings rather than ad-hoc tuning. Real OpenROAD outputs were collected on four design families, and evaluation used leave-one-design-out cross-validation so every reported score reflects generalization to an unseen family.
Two model roles were separated. XGBoost GBDT served as the reinforcement learning reward model because PPO needs stable scalar feedback. The Graph Attention Network served as the Bayesian optimization surrogate because it can exploit real netlist structure when searching for Pareto improvements.
Surrogate quality was measured with three baselines: GAT+c(e), GAT-c(e), and a simple size-plus-clock linear regression. Bayesian optimization used BoTorch qLogNEHVI for 50 trials per family, while PPO ran for 500,000 steps over five seeds to evaluate policy learning under the revised reward pipeline.
Four pre-flight sanity checks were treated as first-class methodology. They caught a broken GAT predictor, a v1/v2 data provenance mix-up, an ORFS version mismatch, and an SDC time-unit bug before those errors could turn into false findings.
Yes — but only when the models, toolchain, and sanity checks all play the right role. This study turns a small real-data campaign into six pre-registered findings about Reinforcement Learning (RL), graph surrogates, and reproducible Machine Learning-for-Electronic Design Automation (EDA).
“The final result is not just a better model. It is a workflow that separates roles, locks tool versions, and checks itself before making claims.”
Watch a concise walkthrough of the full ML-driven PPA workflow, from setup and findings to validation and reproducibility checks.
Self-hosted video file: PPA.mp4
Every chip designer is trading off the same three outcomes.
How much electricity does the chip use? Lower power means less heat, lower operating cost, and better battery life.
How fast can the chip run? Higher fmax means more work done per second.
How much silicon does it occupy? Smaller area usually means more chips per wafer.
“The hard part is that improving one dimension often hurts another. Our framework tries to map that trade-off surface before spending more EDA budget.”
These are the headline results from the final paper — updated for the real-data pipeline.
A monolithic GAT reward model left PPO at -17.6 after 500K steps because the RL agent saw random graphs unrelated to its actions. Replacing the reward model with GBDT lifted IQM to +67.5.
With real netlist graphs, GAT now beats the size-plus-clock linear regression baseline by 2.5× overall. The gain is strongest on structurally rich designs like JPEG.
Depth-based criticality edge weights barely move the score. In this ASAP7/ORFS pipeline, graph topology already exposes the node-count changes driven by tighter clocks and extra buffering.
AES grows from 5→9 frontier points and JPEG from 6→9 over 50 qLogNEHVI trials. Ibex and RISC-V stay flat because the manifest set already spans their reachable trade-off space.
ORFS v3.0 and 26Q1 stay within ±10% on most non-WNS metrics, but timing is fragile: JPEG shifts by -193.4% and flips from timing met to timing violated.
Every BO candidate remains Pareto-valid against 26Q1 baselines. JPEG candidate CAND-6 strictly dominates CTRL-2 on power, area, and fmax while still meeting timing.
Pick a design, move the clock target, and compare an illustrative prediction from the selected model with the observed configuration.
The GAT reads real netlist graphs. The linear baseline only sees coarse size and clock features.
The updated dataset spans four ASAP7 designs, real graph extraction, reinforcement-learning diagnostics, and Bayesian search over real tool outputs. That made it possible to separate a reward-model problem from a surrogate-model problem instead of treating “ML for EDA” as one monolithic question.
The family-level picture is heterogeneous. Ibex is small and regular enough that a linear baseline stays competitive. JPEG is structurally richer, so the graph model earns a large advantage. AES and RISC-V sit between those extremes and still benefit from graph structure once the graphs are real.
| Design | Graph scale | Best hold-out model | BO outcome |
|---|---|---|---|
| AES | ~12K–18K nodes | GAT-c(e): 0.118 | Pareto front 5→9 |
| Ibex | ~9K–11K nodes | LR: 0.071 | No expansion |
| JPEG | ~48K–62K nodes | GAT-c(e): 0.272 | Pareto front 6→9 |
| RISC-V 32I | ~15K–21K nodes | GAT-c(e): 0.112 | No expansion |
Five phases, each with a distinct job, budget, and lesson.
The opening phase stabilized the experiment harness, moved the optimizer to qLogNEHVI, and created the manifest-driven workflow used by every later stage.
Result: a reproducible base layer before spending budget on expensive EDA runs.
Entropy ablation exposed that the monolithic GAT reward model was misaligned with the agent’s actions. Role separation emerged here: GBDT for reward, GAT for BO.
Result: IQM improved from -17.6 to +67.5 once the reward model was stabilized.
This phase extracted real netlists, ran criticality ablations, and evaluated surrogate-only Bayesian optimization on each family.
Result: GAT reached MAE 0.153 and expanded the Pareto set for AES and JPEG.
A small calibration experiment compared ORFS v3.0 and 26Q1 on four designs to test whether tool drift could masquerade as ML progress.
Result: timing proved version-sensitive enough that exact ORFS reporting became mandatory.
The final probe re-ran BO candidates against 26Q1 baselines to ask the hardest practical question: do the proposed points still matter under a newer toolchain?
Result: 7/7 candidates remained Pareto-valid, and one JPEG point strictly dominated its control.
Same visual system, updated stack.
Lower error is better. The old “GBDT beat GAT” story is no longer true on real graphs.
Mean hold-out MAE: GAT+c(e) 0.156 · GAT-c(e) 0.153 · Size+clock LR 0.376.
Ibex is the exception that proves the rule: its regular structure is simple enough that linear regression stays competitive.
The final paper covers four designs and a single dominant control knob. That is enough for mechanism, not for universal claims.
The cross-version probe uses two references per family, so the confirmed Pareto sets are meaningful but still sparse.
Uncertainty estimates overstate the observed variance by roughly 6–9×, so acquisition confidence should not be taken literally.
The C+ study validates candidates against 26Q1 baselines, but there is no separate clean-room rerun that reconstructs the original v3.0 environment end to end.
The key RL finding is no longer “it improved a bit.” The key finding is that the wrong model role blocked learning entirely.
| Setup | 500K IQM | What happened |
|---|---|---|
| Monolithic GAT reward model | -17.6 | PPO optimized against graph inputs unrelated to the action semantics. |
| Entropy ablation | diagnostic | Revealed that the issue was reward feedback quality, not lack of exploration. |
| GBDT reward model + GAT for BO | +67.5 | The policy learned a coherent progression along the clock-target dimension. |
Four pre-flight checks caught four distinct bugs. In the final paper, those checks are part of the contribution because each one blocked a plausible but wrong conclusion.
The entropy ablation showed the agent was not suffering from exploration collapse. It was being fed unstable reward estimates from a broken predictor.
Counterfactual: without this check, the project would have blamed PPO instead of the reward model.
A manual provenance check caught byte-identical DEF files where two data versions were supposed to differ, exposing a dataset lineage mistake.
Counterfactual: graph experiments would have mixed mislabeled samples and corrupted the surrogate comparison.
The calibration run showed that toolchain drift could move WNS enough to flip a timing verdict even when power and area looked stable.
Counterfactual: the study might have attributed tool-version drift to ML improvement.
An implausible-value screen caught a ps-vs-ns mismatch in SDC handling before it turned into a fake performance breakthrough.
Counterfactual: the final probe could have reported impossible fmax gains as if they were genuine wins.
A small calibration experiment compared ORFS v3.0 and 26Q1 on the same four designs. Power, area, and fmax were usually stable. WNS was not.
| Design | WNS | Power | Area | fmax |
|---|---|---|---|---|
| AES | -97.0% | +9.7% | -1.2% | +1.8% |
| Ibex | -155.2% | +3.4% | -0.8% | +2.1% |
| JPEG | -193.4% | +5.8% | -2.1% | +0.9% |
| RISC-V 32I | -89.3% | +4.2% | -1.5% | +1.4% |
The C+ probe asked whether candidates discovered under one toolchain still matter under another. The answer was yes — across the board.
| Point | Power (W) | Area (µm²) | fmax (MHz) | WNS |
|---|---|---|---|---|
| CAND-6 | 0.0683 | 6386 | 1262.75 | 0.00 ps |
| CTRL-2 | 0.0701 | 6389 | 1217.56 | 0.00 ps |
The strongest result is a JPEG encoder configuration proposed by the BO surrogate, synthesized through OpenROAD Flow Scripts 26Q1 on the ASAP7 7nm predictive PDK. It strictly dominates a control configuration on power, area, and frequency simultaneously.