Fixing `IdString` refcounting

IdString refcounting adds overhead and code bloat, and would be even more overhead if we try to make it make safe for multithreading, so it would be good to remove most or all of it.

Instead of refcounting, we could try keeping a global list of all Designs and make the opt_clean pass GC dead IdStrings. We would scan all Designs to identify live IdStrings, also treat all IdStrings generated by the ID() macro and ID:: constants as always live, and delete all the dead ones.

That runs into problem with compatibility with third-party plugins or API users keeping IdStrings in their own data structures, separate from any Design. If only ID() and ID:: constants appear there, that’s fine, but in general that might not be true. We could make more constructors such as IdString(const char *) create immortal IdStrings by default, but that doesn’t help if third-party code grabs arbitrary IdStrings out of a design and stores them somewhere. That seems like a sticking point for this approach and the similar approach jix suggested.

Fow now I’ll assume we have to keep compatible with that kind of behavior. Then I think we have to continue with IdString refcounting and reduce refcount churn in other ways. An easy way to do that is to convert IdStrings into const IdString & in many places. Additionally we could introduce a new type for non-owned IdStrings, say IdRef, that doesn’t require an underlying IdString to refer to, and proclaim that third-party code shall not store IdRefs between passes. We would not treat strings with a zero refcount as a cue to delete the string immediately. Instead we would use the opt_clean GC approach, but keeping all strings with nonzero IdString refcounts alive. Then we could change (probably the vast majority) of IdStrings in Yosys to IdRef. This would include new APIs for looking up and creating new IDs that return IdRef. We could keep RTLIL cell names etc as refcounted IdStrings so we don’t have to trace them explicitly.

How does that sound?

I brought this up during the Dev JF to get more input besides what we’ve been already discussing in the other thread. In particular I was curious whether we’d be able to go with the GC based approach right away, skipping the introduction of a separate IdRef, assuming we first make NEW_ID cheap enough that churning through those within a pass doesn’t become an issue, or that we provide a way to trigger a GC from within passes where that becomes a problem.

The important points are

  • Everyone would be on board with the general IdRef approach. (We didn’t discuss just using more const IdString & but that has been uncontroversial and merged before, here and there.)
  • Everyone would be on board with the GC based approach if we expect only very few things to break and have good and easy to apply recommendations for how to fix the fallout of that.
  • No one was aware of any third-party plugins storing IdStrings outside of the design between passes. For existing flows (including flows for third-party plugins) that we know about, where identifiers do survive outside of the design across passes, that’s done using TCL scripts where those identifiers are stored as actual strings, not ID strings. There might be exceptions but they are likely rare enough that we could impose required changes to separately mark those IDs as in-use.
  • There was some concern that the GC approach without either cheap NEW_ID or ways to trigger GC during safe points within passes would lead to space leaks for longer running passes. (I think we want a cheap NEW_ID in any case, so that shouldn’t prevent us from going with this.)

Afterwards it also occurred to me that we didn’t think of the python bindings, which are currently being rewritten and which we haven’t supported as part of TabbyCAD so far. Since python scripts are free to do whatever, they can also keep IdStrings around outside of passes. In the context of the rewrite, we had already discussed that the python bindings would need to perform some explicit in-use-marking to fix current bugs or rough edges around object lifetimes, so I’m pretty sure extending that requirement to python referenced IdStrings is not an issue.

With that, my suggestion would be to directly aim for the GC approach where we replace IdString with IdRef, getting rid of reference counts entirely. We would still call it IdString, but I’ll keep using IdRef below for clarity while discussing the alternatives. To make sure that the python bindings and the rare exceptional plugin we might have missed can be easily updated to keep working, we could still have a PersistentIdString that causes a reference count to be maintained in a separate global dict<IdRef, int> persistent_ref_counts (or a shared_mutex protected version or some variant of that) where the explicit GC would consider that as an additional GC root in addition to the list of designs. Since this would only be used for the python bindings and for storing IdStrings across pass invocations, it’s not an issue if that comes with significant overhead compared to the current IdString reference counting.

If you’d prefer to keep reference counting for now, and just want to add the minimum necessary to have IdRef as an alternative for use within passes that are updated to use multi-threading, I’m also not opposed to that, but I don’t think we should go ahead and change the vast majority of IdStrings to IdRefs if there is a good chance that we can make them equivalent when we move to the primarily GC based approach in the future.

Sounds good. I can try it.

About potential space leaks in long-running passes … if we want to keep the functionality of NEW_ID_SUFFIX in some form, i.e. any kind of metadata attached to a NEW_ID, then we’ll still have space leaks unless we GC during a pass somehow. Is that a problem?

I see NEW_ID_SUFFIX is used very little compared to NEW_ID so we can just leave that as-is I guess.

Yeah, it’s fine to not worry about that for now.

I implemented this in Implement garbage collection of `IdString`s by rocallahan · Pull Request #5417 · YosysHQ/yosys · GitHub .

I didn’t check its performance properly at first, my bad, but I’ve had a closer look now and fixed the obvious performance problems. Here’s the gist:

Just switching IdString to GC and keeping NEW_ID as is, the “read_rtlil jpeg.rtlil; synth” test is a small but definite speedup:

main:
Benchmark 1: ./yosys -dp "read_rtlil /home/roc/Downloads/jpeg.synth.il; synth"
  Time (mean ± σ):      2.802 s ±  0.007 s    [User: 2.649 s, System: 0.142 s]
  Range (min … max):    2.795 s …  2.816 s    10 runs

with idstring GC:
Benchmark 1: ./yosys -dp "read_rtlil /home/roc/Downloads/jpeg.synth.il; synth"
  Time (mean ± σ):      2.772 s ±  0.006 s    [User: 2.616 s, System: 0.146 s]
  Range (min … max):    2.764 s …  2.785 s    10 runs

Doing a version of what we talked about with NEW_ID, where we use one small hashtable entry per NEW_ID, got a bit complicated, because we now have two possible representations of the IdString text and NEW_ID representation is not a contiguous string. That imposes overhead on a lot of operations, especially comparing IdStrings by name. With some work I was able to reduce the overhead to a low level, but there will inevitably be some. In this test that eats up the gains and gets us back to roughly where we were:

Benchmark 1: ./yosys -dp "read_rtlil /home/roc/Downloads/jpeg.synth.il; synth"
  Time (mean ± σ):      2.805 s ±  0.010 s    [User: 2.651 s, System: 0.147 s]
  Range (min … max):    2.794 s …  2.826 s    10 runs

(Some of the slowdown is in StatPass which populates a bunch of std::maps ordered by IdString name (indexed e.g. by cell type). Actually it would be simple and way more efficient to make those maps unordered and indexed by IdString index, and then just sort the keys by IdString name for printing. I could add that to that PR but that feels like “cheating” in some sense.)

One issue I ran into when putting the PR together is deciding exactly when to GC. I decided to make opt_clean request GC and then actually GC when we reach the start of the next pass that’s not nested under another pass (other than a ScriptPass). I had to do something like that because it turned out that some pass(es) call opt_clean while holding IdStrings outside the design, blowing things up if we do GC in there. In general, GC scheduling could impact performance, but in this benchmark it doesn’t really matter; we do 12 GCs, but the total time comes to around 10ms which is not a problem.

So we have to decide whether to take the NEW_ID “optimization” or not. I’m happy to go either way. Or if anyone has any better idea for a way to implement NEW_ID to reduce its potential for space leakage, I’m happy to hear about it. Arguments in favour of taking the NEW_ID optimization as-is are that it reduces the space cost of NEW_ID to a single hashtable (int, const std::string *) entry (divided by the load factor), and will likely be much more efficient when extended to create IdStrings from multiple threads. Arguments against are that it adds overhead now and adds code complexity.

I guess there’s also a potential problem where if we merge IdString GC now without the NEW_ID representation change, merging the NEW_ID representation change later would be a performance regression, potentially blocking work on mulithreading :slightly_frowning_face:… the perils of exploring the performance landscape monotonically!

Sorry to leave you hanging!

For Yosys, memory constraints are considered more severe than performance degradations. Regressions aren’t categorical dealbreakers, I sometimes put them up as “requested changes” as a simple way of saying “we need to know what’s up and address it” rather than dealbreakers. If we need this either with or without NEW_ID changes that slightly regress perf, I’d accept with these changes. Changes to memory consumption should always be checked alongside perf

We are also lacking sharable ASIC-focused benchmarking setups and the jpeg encoder is only one test case. In the past I’ve used jpeg and ibex from ORFS for synth results but we need a lot more than that and the setup isn’t in a public repo

I see monotonic improvements are something that we want to see on main because any commit should be release-ready. I’d be happy to switch to a workflow with PRs into non-monotonic story branches, do you think this would work out better?

What’s the preferred way to measure memory consumption?

I’d prefer to avoid introducting a different workflow if possible.

The results in the PR currently show only a very small runtime regression vs main, within one standard deviation. I have another PR pending that shows very significant performance improvements. How would you feel about taking the latter PR first and letting me spend a small fraction of those gains on this PR?

The memory consumption reported by Yosys should be reliable, the line End of script. Logfile hash: da39a3ee5e, CPU: user 0.00s system 0.00s, MEM: 29.76 MB peak it prints at exit

taking the latter PR first

sounds good, I’ve just been working through my backlog chronologically

Ok, here are the numbers reported by Yosys:

main:
End of script. Logfile hash: 3b1cd56f8e, CPU: user 13.69s system 0.69s, MEM: 676.72 MB peak

idstring-gc:
End of script. Logfile hash: a41e14300d, CPU: user 13.52s system 0.77s, MEM: 673.59 MB peak

sigspec-onechunk:
End of script. Logfile hash: 8410f89b41, CPU: user 12.44s system 0.73s, MEM: 668.20 MB peak

Which looks good…