Can a Computer Learn to Design Better Chips?

Yes — but only when the models, toolchain, and sanity checks all play the right role. This study turns a small real-data campaign into six pre-registered findings about Reinforcement Learning (RL), graph surrogates, and reproducible Machine Learning-for-Electronic Design Automation (EDA).

23
Configurations
6
Findings
$8.35
Total Cost
7/7
Pareto Confirmed

“The final result is not just a better model. It is a workflow that separates roles, locks tool versions, and checks itself before making claims.”

Project Explainer Video

Watch a concise walkthrough of the full ML-driven PPA workflow, from setup and findings to validation and reproducibility checks.

Self-hosted video file: PPA.mp4

What Is Power, Speed, and Size?

Every chip designer is trading off the same three outcomes.

Power

How much electricity does the chip use? Lower power means less heat, lower operating cost, and better battery life.

🏎️

Performance

How fast can the chip run? Higher fmax means more work done per second.

📐

Area

How much silicon does it occupy? Smaller area usually means more chips per wafer.

“The hard part is that improving one dimension often hurts another. Our framework tries to map that trade-off surface before spending more EDA budget.”

Six Pre-Registered Findings

These are the headline results from the final paper — updated for the real-data pipeline.

Finding 1 · Role Separation
+85.1 IQM

A monolithic GAT reward model left PPO at -17.6 after 500K steps because the RL agent saw random graphs unrelated to its actions. Replacing the reward model with GBDT lifted IQM to +67.5.

Finding 2 · GAT Surrogate Quality
MAE = 0.153

With real netlist graphs, GAT now beats the size-plus-clock linear regression baseline by 2.5× overall. The gain is strongest on structurally rich designs like JPEG.

Finding 3 · Criticality Null
ΔMAE = -0.003

Depth-based criticality edge weights barely move the score. In this ASAP7/ORFS pipeline, graph topology already exposes the node-count changes driven by tighter clocks and extra buffering.

Finding 4 · Surrogate-Only BO
Pareto Growth

AES grows from 5→9 frontier points and JPEG from 6→9 over 50 qLogNEHVI trials. Ibex and RISC-V stay flat because the manifest set already spans their reachable trade-off space.

Finding 5 · Version Sensitivity
WNS Sign Flips

ORFS v3.0 and 26Q1 stay within ±10% on most non-WNS metrics, but timing is fragile: JPEG shifts by -193.4% and flips from timing met to timing violated.

Finding 6 · C+ Probe
7/7 Confirmed

Every BO candidate remains Pareto-valid against 26Q1 baselines. JPEG candidate CAND-6 strictly dominates CTRL-2 on power, area, and fmax while still meeting timing.

Interactive Explorer

Pick a design, move the clock target, and compare an illustrative prediction from the selected model with the observed configuration.

0.85× tight3.0× relaxed

Lower multipliers demand more timing slack and usually trigger more buffering and bigger cells. Higher multipliers relax the flow.

The GAT reads real netlist graphs. The linear baseline only sees coarse size and clock features.

+4.1
Predicted WNS (ps)
48.6
Predicted power (mW)
1,654
Predicted area (µm²)
Selected family hold-out
0.118
GAT MAE for AES
Observed configuration
-13.9 ps
49.2 mW · 1,639 µm²

What the selected model leans on

Real netlist graphs are now in the loop: PyTorch Geometric extracted graphs from roughly 9K to 62K nodes per design, and the final GAT reaches 0.153 MAE overall. The caveat is no longer “placeholder graphs” — it is when structure helps, by how much, and for which family.

What 23 Real Configurations Changed

The updated dataset spans four ASAP7 designs, real graph extraction, reinforcement-learning diagnostics, and Bayesian search over real tool outputs. That made it possible to separate a reward-model problem from a surrogate-model problem instead of treating “ML for EDA” as one monolithic question.

The family-level picture is heterogeneous. Ibex is small and regular enough that a linear baseline stays competitive. JPEG is structurally richer, so the graph model earns a large advantage. AES and RISC-V sit between those extremes and still benefit from graph structure once the graphs are real.

DesignGraph scaleBest hold-out modelBO outcome
AES~12K–18K nodesGAT-c(e): 0.118Pareto front 5→9
Ibex~9K–11K nodesLR: 0.071No expansion
JPEG~48K–62K nodesGAT-c(e): 0.272Pareto front 6→9
RISC-V 32I~15K–21K nodesGAT-c(e): 0.112No expansion

How We Built the Final Pipeline

Five phases, each with a distinct job, budget, and lesson.

Phase 0 · Bootstrap

qLogNEHVI migration and framework setup

The opening phase stabilized the experiment harness, moved the optimizer to qLogNEHVI, and created the manifest-driven workflow used by every later stage.

$0.85Framework bring-upOptimizer swap

Result: a reproducible base layer before spending budget on expensive EDA runs.

Phase 1 · RL Training

500K PPO steps across five seeds

Entropy ablation exposed that the monolithic GAT reward model was misaligned with the agent’s actions. Role separation emerged here: GBDT for reward, GAT for BO.

$1.20500K PPO5 seeds

Result: IQM improved from -17.6 to +67.5 once the reward model was stabilized.

Phase 2a · GAT + BO

Real graph extraction, GAT training, 50-trial BO

This phase extracted real netlists, ran criticality ablations, and evaluated surrogate-only Bayesian optimization on each family.

$2.95LODO-CV50 trials/family

Result: GAT reached MAE 0.153 and expanded the Pareto set for AES and JPEG.

EDA Calibration

ORFS version sensitivity probe

A small calibration experiment compared ORFS v3.0 and 26Q1 on four designs to test whether tool drift could masquerade as ML progress.

$0.654 designsVersion deltas

Result: timing proved version-sensitive enough that exact ORFS reporting became mandatory.

C+ Probe

Cross-version generalization test

The final probe re-ran BO candidates against 26Q1 baselines to ask the hardest practical question: do the proposed points still matter under a newer toolchain?

$2.707 candidates3 controls

Result: 7/7 candidates remained Pareto-valid, and one JPEG point strictly dominated its control.

The Tools Behind the Results

Same visual system, updated stack.

🔧
ORFS
OpenROAD Flow Scripts in both v3.0 and 26Q1. Exact versioning became one of the findings, not a footnote.
📐
ASAP7 PDK
Predictive 7nm process kit used to keep the entire study open, reproducible, and comparable across runs.
🕸️
PyTorch Geometric
Real netlist graph extraction and GAT training on design graphs ranging from roughly 9K to 62K nodes.
🌲
XGBoost GBDT
Stable RL reward model on tabular features; the role-separated choice that unlocked PPO with final MAE = 0.153 on its reward feature set.
🧠
GAT + MC Dropout
Graph surrogate for Bayesian optimization with 50 stochastic forward passes per acquisition step.
🎯
BoTorch qLogNEHVI
Multi-objective Bayesian optimization over 50 trials per family to grow the Pareto frontier where opportunity remained.
🤖
Stable-Baselines3 PPO
500,000 RL training steps across five seeds, plus entropy ablations that exposed the need for reward-model role separation.
☁️
AWS EC2
c6i.4xlarge instances with 16 vCPU, 32GB RAM, no GPUs, at roughly $0.68 per hour for the full campaign.

How Accurate Were the Surrogates?

Lower error is better. The old “GBDT beat GAT” story is no longer true on real graphs.

Mean hold-out MAE: GAT+c(e) 0.156 · GAT-c(e) 0.153 · Size+clock LR 0.376.

Ibex is the exception that proves the rule: its regular structure is simple enough that linear regression stays competitive.

Why GAT wins now

  • It finally sees the real synthesized netlist instead of a placeholder graph.
  • Large, irregular families such as JPEG benefit from structural context that coarse baselines cannot represent.
  • Leave-one-design-out evaluation shows the gain is about generalization, not memorizing one family.
  • Role separation lets the graph model focus on BO where structural inductive bias is useful.

Why criticality barely matters here

  • Adding depth-based criticality weights changes mean MAE by only -0.003.
  • Within each family, tighter clocks trigger more buffers, so node count tracks clock target almost perfectly.
  • The graph topology already exposes that structural inflation without explicit criticality weights.
  • This is a pipeline-specific null result, not a dismissal of criticality in general.

Scale is still small

The final paper covers four designs and a single dominant control knob. That is enough for mechanism, not for universal claims.

Reference sets are thin

The cross-version probe uses two references per family, so the confirmed Pareto sets are meaningful but still sparse.

MC Dropout is miscalibrated

Uncertainty estimates overstate the observed variance by roughly 6–9×, so acquisition confidence should not be taken literally.

No true v3.0 verification rerun

The C+ study validates candidates against 26Q1 baselines, but there is no separate clean-room rerun that reconstructs the original v3.0 environment end to end.

🏆
Headline Result
7/7 Pareto Candidates Confirmed Across Versions
The strongest public-facing result is now the C+ probe, not the early 4.6× RL story. Every Bayesian-optimization candidate survived comparison against 26Q1 baselines, and JPEG candidate CAND-6 strictly dominates CTRL-2 on power, area, and fmax while still meeting timing.

Role Separation Turned RL Around

The key RL finding is no longer “it improved a bit.” The key finding is that the wrong model role blocked learning entirely.

When PPO used a monolithic GAT reward model, it was optimizing against randomly constructed graphs with no semantic link to the chosen action. Entropy ablation exposed that mismatch. Swapping the reward model to GBDT while keeping GAT for BO produced a coherent clock-target policy and a large positive IQM.
0 steps500,000 steps
12.0
Role-separated IQM
-17.9
Monolithic GAT IQM
+29.9
Current gap
🟡
Policy becoming coherent
Setup500K IQMWhat happened
Monolithic GAT reward model-17.6PPO optimized against graph inputs unrelated to the action semantics.
Entropy ablationdiagnosticRevealed that the issue was reward feedback quality, not lack of exploration.
GBDT reward model + GAT for BO+67.5The policy learned a coherent progression along the clock-target dimension.
Finding 1 is a role-separation result: the same project succeeds once RL and BO stop sharing a single model that was never appropriate for both jobs.

Sanity Checks as Methodology

Four pre-flight checks caught four distinct bugs. In the final paper, those checks are part of the contribution because each one blocked a plausible but wrong conclusion.

Check 1 · RL entropy ablation

Broken GAT predictor

The entropy ablation showed the agent was not suffering from exploration collapse. It was being fed unstable reward estimates from a broken predictor.

Counterfactual: without this check, the project would have blamed PPO instead of the reward model.

Check 2 · Data provenance spot-check

v1/v2 data confusion

A manual provenance check caught byte-identical DEF files where two data versions were supposed to differ, exposing a dataset lineage mistake.

Counterfactual: graph experiments would have mixed mislabeled samples and corrupted the surrogate comparison.

Check 3 · EDA calibration probe

ORFS version mismatch

The calibration run showed that toolchain drift could move WNS enough to flip a timing verdict even when power and area looked stable.

Counterfactual: the study might have attributed tool-version drift to ML improvement.

Check 4 · C+ implausible-value test

SDC time-unit mismatch

An implausible-value screen caught a ps-vs-ns mismatch in SDC handling before it turned into a fake performance breakthrough.

Counterfactual: the final probe could have reported impossible fmax gains as if they were genuine wins.

Version Sensitivity

A small calibration experiment compared ORFS v3.0 and 26Q1 on the same four designs. Power, area, and fmax were usually stable. WNS was not.

DesignWNSPowerAreafmax
AES-97.0%+9.7%-1.2%+1.8%
Ibex-155.2%+3.4%-0.8%+2.1%
JPEG-193.4%+5.8%-2.1%+0.9%
RISC-V 32I-89.3%+4.2%-1.5%+1.4%
Timing can change sign across ORFS versions even when the other metrics look nearly unchanged. Any ML-for-EDA pipeline using ORFS must lock and report the exact version.

Cross-Version Generalization (C+ Probe)

The C+ probe asked whether candidates discovered under one toolchain still matter under another. The answer was yes — across the board.

7/7
Candidates on Pareto front
3
Controls checked
0.00
Timing met for all 10

JPEG CAND-6 advantage over CTRL-2 (positive means better)

PointPower (W)Area (µm²)fmax (MHz)WNS
CAND-60.068363861262.750.00 ps
CTRL-20.070163891217.560.00 ps

The strongest result is a JPEG encoder configuration proposed by the BO surrogate, synthesized through OpenROAD Flow Scripts 26Q1 on the ASAP7 7nm predictive PDK. It strictly dominates a control configuration on power, area, and frequency simultaneously.

OT
Oguzhan Tekin
Machine Learning and Artificial Intelligence Researcher
https://github.com/oguzhan-canada/instrumented-ml-ppa
oguzhantekin@gmail.com