New PR with up to 2.57x performance improvements

(numbers based on benchmarks and YMMV)

I’ve opened a PR with some perf enhancements which I think will make a significant difference in speeding up some aspects of synthesis. I’ve sync’d with main and all tests appear to be passing. Wondering if a discussion is necessary here to get it merged?

Thanks!

There’s a lot going on here and I’d guess some of those changes are based on correct and helpful observations (in fact I’m seeing something like a 8% perf improvement on ibex and jpeg synthesis) but the way this has been put together makes the code inherently untrustworthy and the scope is way overweight.

First, you can’t rely on the tests so hard as to just point an LLM with tests in the loop at a problem you have. Yosys is a piece of infrastructure that is used in a lot of different ways and changes to how it works need sound argumentation for their correctness based on the semantics in general cases. Even a lot better testing than we have is not proof enough of correctness to allow code to be added by machines without the understanding of a human who has understood the prior design. Until we have general definitions of correctness with automated proof checking, it’s rarely enough to say “tests pass, so my functional change is good”. It’s also a legacy codebase where almost everything is subpar compared to what we know could be our best work, so extending Yosys based on how the rest of Yosys looks and works like is often the wrong thing to do too. Don’t vibe code Yosys, don’t deploy agents to work on Yosys. I think we’ll have to codify

Second, separate changes to separate parts of the codebase get reviewed separately. As you can see in the case of this PR that would be a lot of parts. That’s just a result of the high effort low refinement strategy you get with LLMs tasked with solving a problem. You can do that to see what changes may help, but you have to refine it yourself. Recognize invalid changes and remove things that don’t actually help. I think it’s your responsibility as a contributor to be the first quality filter in this. When you rely on maintainers to be this filter, you’re likely throwing a lot of trash at them alongside improvements, which is not actually helpful. Even without a close look, it’s pretty apparent to me that there’s commits here that are quite meaningless and might point to issues with the way you’re evaluating their merit, measuring noise, or maybe not even verifying that changes help in any situation, for example. There’s cost to complexity that LLM users don’t see but adding something that doesn’t pay its rent is a problem.

So, third, there’s parts of your patch set that I’d like to address immediately

  • Should yosys have an agents/ directory or AGENTS.md? No, we don’t want to encourage this kind of work without a responsible human in the loop. I’m sure nobody long-term involved with the project believes that would lead to any good outcomes for software quality. Additionally, it would erase a lot of good will among existing and potential contributors over various quite valid reasons to completely disassociate with LLM-friendly projects.
  • Should it contain a perf benchmarking script? No, there is an existing separate repo that you’d know about if you talked to us before submitting a mega PR
  • Should opt passes avoid reprocessing things needlessly? Yes, incrementality based on dirtiness tracking is part of a larger work in progress based on long internal discussions. I’m happy to discuss alternatives including this one in the comments of that PR or a separate Discourse thread
  • Reverting opt_muxtree index density would be a good idea to discuss in the comments of the PR that added it

I’m probably going to post more feedback on select parts later since I’m curious at where the perf uplift I’m seeing is actually coming from. What exactly are the scripts and sources you’re using to observe the great improvements you’re referring to? You never say what “woc3k” is, and what yosys passes you’re actually testing.

Totally fair feedback, and I understand your concern around the use of LLMs. I’ll address some of the main issues:

  • I see a mistake in my .gitignore, it widely ignored the generated benchmark files apologies I’m adding that back in momentarily
  • The methodology here was to focus on simpler, less risky aspects of the code that had less of a chance of mutating an outcome
  • The “script” is an agent model that is run in a perf bisect loop with a strategy to assess risk, look at baseline, and then compare the optimization with the baseline and determine if it’s a regression, if you don’t want to encourage this I’m happy to remove it
  • Understandable about the scope, my suspicion about the codebase was that there was probably a mass of cheap wins that when added up it would result in a substantial overall performance improvement
  • Beyond these small improvements, it appeared that parallelization of synthesis for tasks like multi-module was a more involved job that was high risk, and I purposely avoided undertaking those risks

Would it make more sense to cherry pick commits in to a new branch? Or do you want to abandon it?

Here’s a summary of the focus areas where modifications were made (this is an LLM summary):

  1. Pass-level early-bail auditspasses/proc/proc_{dlatch,dff,arst,init,prune,rom,memwr}.cc, passes/opt/{opt_dff,opt_muxtree,opt_reduce,opt_share}.cc. Many passes built O(N) indices (SigMap, sigusers, mux_drivers, ConstEval seed) before discovering the work-set was empty. Five-line if (selection_empty) continue; checks at the top of execute() were repeatedly worth 30%+. proc_dlatch alone was wasting ~400ms/call on FF-free combinational designs. Drove most of mec30k/mec60k’s gains.

  2. opt -full convergence early-exitpasses/opt/opt.cc. When the prior iteration’s heavy passes (opt_share, opt_dff) reported no work, the next iteration’s cleanup phase couldn’t find anything new either. Tracking that and breaking out of the convergence loop one iteration earlier yielded +28% on woc3k from a single commit (140d0e5b2). Most of woc3k’s 2.57× speedup is this one change.

  3. opt_expr inner-loop cachingpasses/opt/opt_expr.cc. The pass has its own do-while convergence loop within a single execute() call (8+ iterations on picorv32_x4). Caching TopoSort.sorted + invert_map across iterations and skipping the entire body when the validated last call did nothing yielded −9% on picorv32_x4 across four commits (3490357bb, 18ee10839, fb2c7bb77, f6dc99416).

  4. kernel/hashlib.h foundationkernel/hashlib.h. Affects every dict/pool lookup in the codebase; gains compound through every pass. Cached entries.data() / hashtable.data() to defeat compiler aliasing pessimism (~7% geomean), killed runtime asserts the build wasn’t compiling out (~2-3%), strategic prefetch on chain walks, and libdivide-style fastmod replacing % size (~8%). Spans commits cd5b1d025, 41bf34dfa, d9f51d71a, 06ab89826, 5cba01847 + correctness fix 7ab7606ff.

  5. Per-Module generation counterkernel/rtlil.{h,cc}. Added uint64_t generation bumped on every structural mutation (Module::add/remove/connect/new_connections, Cell::setPort/unsetPort). Direct gain ~5% on picorv32_x4 via opt_expr’s cache validation, but the real value was unlocking safe cross-execute() no-op caches that were previously blocked by Module pointer reuse after design -reset. Foundation for items 6 and 7.

  6. Generation-counter no-op caches retrofitted to 8 passespasses/opt/{opt_dff,opt_share,opt_muxtree,opt_reduce,opt_clean,wreduce,peepopt}.cc, passes/techmap/alumacc.cc. Pattern: static std::map<Module*, {generation, last_did_something}> at top of execute(); skip when both validated. Catches the convergence-check iterations where the pass would have walked the whole module just to confirm nothing changed. Spans dc2a8b534, c3fe2d359, fab4ba6bb, 1ac6071c0.

  7. kernel/rtlil. SigSpec hot paths* — kernel/rtlil.{h,cc}. SigSpec::read_at skips the bounds check on the iterator-internal hot path; try_repack got a cheap pre-check to skip the O(N) walk on guaranteed misses; two rvalue ctors that were silently copying got fixed to actually move. Small per-call (~ns each) but called billions of times across the synth flow. Spans e224fa9ee, a549a4cb2, d39ae49ef, e03db34ae.

  8. AST genrtlil per-cell helpersfrontends/ast/genrtlil.cc. The four helpers (uniop2rtlil, widthExtend, binop2rtlil, mux2rtlil) called set_src_attr twice per call, each time formatting the same loc_string(). Hoist once and reuse via a new set_src_attr(AttrObject*, const std::string&) overload. Also dropped a wasted stringstream construction in RTLIL::encode_filename’s ASCII fast path. ~3-4% on mec60k (60006 helper calls per top module). Commit 8e1ef04d5.

  9. read_verilog frontend winsfrontends/verilog/preproc.cc, frontends/ast/genrtlil.cc. Switched preprocessor’s output_code from std::list<std::string> to a single accumulator string (76a57374a); short-circuit directive comparison chain for non-backtick tokens (c1116cd33); skip empty-attributes loops in genrtlil helpers (671c71f38); collapse 3 hash lookups into 1 in AST_IDENTIFIER (e0bfefbc0). Small individually, ~5% combined on read-bound benchmarks.

A fresh clone of the branch can now run bash tests/perf/run_loop.sh my-label and reproduce the benchmark setup end-to-end.

Also meant to address the reversion of opt: Remove O(n²) opt routines across the codebase for pmux by QuantamHD · Pull Request #5809 · YosysHQ/yosys · GitHub this appeared to have introduced a performance regression hence it was reverted. It was a simple conflict with the optimizations that my branch had made, and can probably be re-merged by hand (want me to do this?)