Parallel OptMergePass implementation

OptMerge gets run a lot, is expensive on some of our large circuits, but is fairly simple, so I thought it would be a good place to start experimenting with parallelism. I got it working, and it wasn’t too hard, and I’m here to explain the approach and get feedback. The code is here.

Results

This is on a 24-CPU cloud VM with 48 hyperthreads, on a single large circuit. The “1 thread” number is actually 1.2x speedup over Yosys baseline (i.e. the new code is faster even running singlethreaded). The baseline runtime is about 100s, so a 20x speedup is significant. On this circuit the OptMerge pass removes about 10% of cells over 15 iterations.

The algorithm

Currently OptMerge builds a hashtable of (almost) all cells, treating cells as equal if they have the same type, parameters and input signals, and removes any cells that match cells already added to the hashtable. When we remove a cell in favour of an existing cell, the outputs are merged together, creating more removal opportunities, so we repeat the whole process until no cells are removed.

We want to do something similar in parallel. The basic approach is, for each iteration:

  • Compute and store the hashes of all relevant cells, in parallel.
  • Given N = the number of threads, partition the cells into N buckets by hash value: bucket k contains the cells whose hash value mod N = k.
  • For each bucket in parallel, build a hashtable of that bucket’s cells (using the precomputed hashes) and record the duplicates found.
  • On the main thread, process the list of duplicates to remove cells.

For efficiency we fuse the second step into the first step by having the parallel threads write the cells into buckets directly. To avoid synchronization overhead, we divide each bucket into N shards. Each thread j adds a cell to bucket k by writing to shard j of bucket k — no synchronization required. In the next phase, thread k builds the hashtable for bucket k by iterating over all shards of the bucket.

Some nice properties:

  • Only the main thread ever writes to RTLIL, and never while other threads are running. During the parallel phases, all access to RTLIL is read-only.
  • No synchronization required except for barriers at the end of each parallel phase.
  • When processing the list of duplicates, we sort them into the exact order the current OptMerge code would have found them, guaranteeing that the parallel code produces exactly the same results as the current code (and therefore is deterministic).

TRTLIL

The real challenge is providing read-only thread-safe access to RTLIL in a maintainable way. Various parts of RTLIL are not safe for multithreaded access, even if you only use const methods. Some problems I know about that would be really hard to fix without large-scale code updates or performance penalties:

  • SigSpec can switch between two internal representations, e.g. when calling operator[] const.
  • SigMap lookups use union-find with internal pointer updates for amortized efficiency.
  • IdString copying and deletion can bump refcounts.

These are core datatypes and trying to avoid accidentally triggering these behaviours would be error-prone. Instead I have created a layer of wrappers providing thread-safe read-only access to RTLIL, which I’m calling TRTLIL. Parallel threads are required to stick to TRTLIL to avoid thread safety issues.

TRTLIL contains two kinds of types. “R” types are just read-only wrappers around underlying RTLIL types. E.g. RIdString wraps an IdString. Unlike the regular IdString, copying an RIdString does not bump refcounts. In fact there is no way to change IdString refcounts in parallel threads — IdStrings can’t be created or destroyed. We also have RCell, RFfInitVals, RCellTypes, and RModule.

“T” types reimplement existing RTLIL types in a thread-safe way. We have TSigBit, TSigSpec and TSigMap. TSigSpec avoids the problems of SigSpec by only having the “vector of TSigBit" representation. For TSigMap, we can’t use union-find, so instead we represent the equivalence classes of signals as explicit sets of TSigBits with eager union. With some care we ensure that building a TSigMap is O(N log N) in the number of union operations.

TRTLIL objects don’t expose direct access to RTLIL values, so it’s not possible for a parallel thread to accidentally start accessing the underlying RTLIL in non-thread-safe ways.

We can’t provide direct access to SigSpecs for TRTLIL users, and we don’t want a fundamental operation like cell getPort to make expensive copies, so RCell doesn’t expose getPort at all currently. It turns out that OptMerge never uses the raw signal of getPort directly — it always immediately applies a SigMap or FfInitVals. So RCell provides port accessors that also apply a TSigMap or RFfInitVals. These accessors internally iterate over the SigSpec and apply the mapping to build a TSigSpec, so there’s no extra overhead compared to the current code and no thread-unsafe data is leaked to callers.

Risks

TRTLIL has to know that the const RTLIL methods it calls are thread-safe. In most cases that’s easy to verify by hand, but the real danger is that in the future someone changes RTLIL code to invalidate that analysis, e.g. by adding a call to a const SigSpec method somewhere that changes the representation, or by accidentally copying an IdString, in code reachable by TRTLIL. I don’t have a great solution to this problem. One thing we could do is have a global flag that we set during parallel read-only passes, and the thread-unsafe const RTLIL methods would abort if that flag is set.

Looking ahead

OptMerge is particularly simple. Next I plan to look at OptClean and OptExpr, which are also expensive and frequently executed, but a lot more complicated. I don’t know yet how well this TRTLIL approach (phases of read-only parallel execution interleaved with main-thread work) will extend to those passes. It might be reasonable to hold off on upstreaming this parallel OptMerge approach until we see how well it extends to those passes.

For ensuring the thread safety of the const RTLIL methods could we potentially create unit tests that access the method in question in multiple threads, and run the tests under TSAN?

The TSAN could trigger a failure providing us at least some protection against future changes

This is a first round of high level feedback without having taken a closer look at the code. I have some initial feedback on the general approach, on using this to parallelize the passes that run during opt -fast specifically, and on managing the risks of accidentally breaking the thread-safety of the the read-only views.

The General Approach

I think the general approach for enabling parallel read access to RTLIL makes a lot of sense. The idea of providing wrappers for a thread safe view came up on the parallelizing ABC PR before, so that’s also very much what I was expecting here. I hadn’t considered additional thread safe variants of the RTLIL types, but it makes sense that a worker thread should be able to perform some operations based on the read data locally, without the need to reinvent the world in every pass, so that also makes sense.

Pending a review of the actual implementation details and addressing anything that comes up during that, I think we can merge this as soon as we have a first pass that makes effective use of this, but also wouldn’t mind waiting until this had the chance to prove itself with multiple passes

Parallelizing opt -fast

Your results show a very nice speedup and enabling that in the near term is a sufficiently good argument to work towards merging it. What I don’t think, though, is that the way this parallelizes opt_merge is the way forward in the long term. Similar for any way to parallelize opt_expr that is based on the way that currently schedules propagating constants and simplifying cells based on what was propagated.

To put it (maybe a bit too) bluntly, I think this is mostly parallelizing redundant work that we shouldn’t be doing in the first place, and that avoiding that will get us at least as much speedup without any parallelization at all. It also would mean that the current approach to parallelize opt_merge would no longer apply. At the same time, I also think that it will take a lot more work to get us there, and I don’t want a hypothetical[1] future single threaded speedup get in the way of providing an actual speedup today (or the near future).

This means I am very much leaning towards us merging something like this in the short term, but I expect that, eventually, we would want to replace this with something that actually avoids and not just parallelizes redundant work.

Not Breaking the “R”-Wrappers

I view both the “R”-wrappers and the “T”-alternatives as a temporary stop-gap solution. I am convinced that the right thing to do is evolving core RTLIL itself to be thread-safe for read only uses and to allow concurrent modifications of different modules, as long as the modified module stays exclusive to a worker and no one is modifying shared modules.

This is not trivial, but I suspect that everything we currently do that isn’t thread safe, is also a performance problem or at least somewhat of a performance foot-gun. I don’t mean that we can just keep everything as it is but add some thread-safety at no cost, but rather that for every thing where we rely on inherently non-thread-safe approaches, there is an alternative that has advantages that go beyond just the thread-safety aspect.

With that I would expect that most changes to RTLIL would reduce friction for the “R”-wrappers and that the “T”-alternatives could be a useful testbed to explore some of the ideas for evolving core RTLIL itself.

We should still make sure to test with TSAN and should be adding manual safeguards where we can, but I would be wary of relying on that exclusively if everything else would point in the direction of things breaking regularly.


  1. My claims about single threaded performance gains are based on actual experiments in performing the equivalent of opt -fast in a formal verification backend I’ve worked on. In that sense they aren’t hypothetical, but there are a lot of technical challenges to overcome that were a lot easier to solve in a from scratch design for something that has a much smaller scope than Yosys does. I can share more details, but would prefer to do that separately to keep this on topic. ↩︎

Thanks, this is excellent feedback.

To put it (maybe a bit too) bluntly, I think this is mostly parallelizing redundant work that we shouldn’t be doing in the first place, and that avoiding that will get us at least as much speedup without any parallelization at all.

I’d like to hear more about what you envision here. Could be another thread, but I’d like to understand the bigger picture before deciding how to proceed here.

I am convinced that the right thing to do is evolving core RTLIL itself to be thread-safe for read only uses

I assume you mean the standard approach where we would require “const” methods to be safe for multithreaded read-only access, and one way or another fix the places where that’s not true currently. I like that direction, but it’s very hard to do given the combination of a) violations in some very frequently used APIs b) exposure of the internals of some core data structures and c) source-incompatible changes requiring a deprecation period for the sake of third-party plugins. Still, it’s worth talking about what will be required to get there.

I fixed updict and pool lookup, and was able to fix Const. I had a good look at SigSpec, with the idea of replacing all usage of operator[] const with usage of a const_iterator that can work (efficiently) with either representation so we don’t need to unpack under the hood. But there is a lot of code calling operator[] const, some of it is tricky to convert, and there’s probably third-party plugin code that would need to go through a deprecation period before we can actually remove operator[] const. So I decided not to pursue that for now.

For SigMap we could switch to an implementation like TSigMap without breaking users. There could be a detectable performance regression in some cases, although it probably wouldn’t be serious. I don’t think there’s any way to provide the same functionality while being thread-safe and always just as fast as union-find. Maybe the whole SigMap approach to canonicalizing signals is the wrong way to go, but changing that would mean rewriting large chunks of Yosys AFAICT.

IdString is another tough one. Safe multithreaded creation of IdStrings for new strings basically requires a growable concurrent hashtable, which will have detectable overhead, unless maybe we have a global mode switch to switch into and out of multithreaded mode. Switching refcount management to atomics would also have detectable overhead… and accidental IdString copying could introduce hard-to-diagnose performance issues due to ping-ponging. It might be possible to implement some kind of deferred reference counting that keeps thread-local refcount state and reconciles globally when you switch out of multithreaded mode, but that’s complicated and I don’t know how well it would work.

and to allow concurrent modifications of different modules, as long as the modified module stays exclusive to a worker and no one is modifying shared modules.

I’m not very enthusiastic about parallelizing across modules because that doesn’t work well when you have highly unbalanced module sizes. But yes, if we are able to fix IdString and other things to be safe for multithreaded read-only access, we would probably get nearly for free the ability to write to different modules on different threads.

A brief summary is to make things incremental so that the cost of e.g. one opt_merge iteration becomes roughly proportional to the delta the netlist has seen since the previous pass, instead of proportional to the size of the full module. For opt_merge this is more or less the same as the “rebuilding” algorithm for e-graphs that egg introduced, and that can be extended to handle constant propagation and some other confluence rewrites such as replacing cells with simpler cells based on their local connectivity as well. (The paper is about equality saturation, but the parts that I’m talking about can be used without adding any equivalences that are not structural equivalences and I’m also not talking about performing any equality saturation.)

The difference between what yosys currently does and what this approach does is also similar to the distinction between naive and semi-naive evaluation of datalog or approaches to incremental view maintenance in databases.

While I have implemented a stand-alone bit-level version of this before, there are a lot of open questions about how to best retrofit everything that is needed to replicate this in yosys, extending it to all practically supported RTLIL. E.g. how to extend everything to also deal with word-level operations, how to evolve our APIs so that we can make the necessary changes to the underlying data structures to support an efficient implementation of this. Also whether to try to get there more incrementally or whether it will ultimately require rewriting some core parts as a whole.

That’s exactly what I mean.

While we do prefer having a deprecation period for source incompatible changes, that specifically is not a hard requirement, it’s more that this is what the default should be since that is relatively safe to do and that if we can’t do that, we need to have a discussion about the necessary preparation, the overall impact and the additional support work that comes with breaking changes without a deprecation period (whether to APIs used by plugins or behavior exposed in other ways). So I agree that it is worth talking about what it takes to get there.

I will go into some of the ideas I have here in a follow up post since it’s getting late here.

In this context I was thinking of modules more as a primitive to hold parts of logic to move between threads where those modules don’t necessarily have to correspond to modules of the original design. E.g. by splitting large modules using balanced partitioning or community detection (potentially duplicating high-fanout signals across partitions or excluding those entirely, depending on the use case). This is something I had also considered not only in the context of parallelization but also as a heuristic to improve the scalability of anything that has inherently poor scalability with module size, e.g. because it requires iterating over all pairs of cells or even worse than that.

The difference between what yosys currently does and what this approach does is also similar to the distinction between naive and semi-naive evaluation of datalog or approaches to incremental view maintenance in databases.

It’s also similar to incremental layout in browser engines, which I worked on for a long time.

If I understand correctly, you’d still have an Opt pass or passes, but you’d track "dirty” cells and wires in the design so those passes were incremental.

You might want parallelism as well as incrementality in some cases. My opt_merge benchmark simply loads an RTLIL file and then runs opt_merge on it — there would be no benefit from incrementality there. But obviously most opt runs would benefit from incrementality.

Food for thought.

It’s also similar to incremental layout in browser engines, which I worked on for a long time.

While I’m not familiar with the techniques used there, it does sound like this type of problem.

If I understand correctly, you’d still have an Opt pass or passes, but you’d track "dirty” cells and wires in the design so those passes were incremental.

Yes, but you also need persistent fan-in and fan-out indices that allow you to efficiently navigate to cells adjacent to dirty cells.

You might want parallelism as well as incrementality in some cases.

Certainly! I wasn’t trying to argue against parallelism, just that I expect the opportunity of parallelizing full module scans to go away if we stop doing full module scans in these passes. And that if you compare just parallelism vs just working incrementally, I’d expect working incrementally to win out most of the time, and with quite a margin in some cases. You could still e.g. parallelize work that iterates over a list of dirty cells, but for many of those scans that list will likely be small enough that you won’t see speedups similar to the great numbers you are getting now.

My opt_merge benchmark simply loads an RTLIL file and then runs opt_merge on it — there would be no benefit from incrementality there.

But you mentioned that it took 15 iterations of the outer loop in opt_merge, so even there, starting from the second iteration, there should be a lot less work to be done when it is done incrementally.

Additionally, the persistent fan-in and fan-out indices can be used to quickly exclude all dirty cells that have a fan-in for which they are the unique fan-out for the port connecting them. Depending on how expensive the computation of the opt_merge cell hashes is in practice, I could even imagine that this would help with the current approach where each iteration indexes everything from scratch. Though, without keeping that index around and thus without updating it incrementally, I would expect that whether this would help or not would depend to some degree on the connectivity structure of the design.

IdString is another tough one.

I also think that this is the trickiest one, I’m not at all sure what the best solution is, but here is one thing we could do:

How we’re creating new IdStrings

An important realization is that, apart from NEW_ID, we’re not really creating any new distinct IdStrings during optimization passes nor are we initializing IdStrings from plain strings. One exception are temporary IdStrings used exclusively to look up something by name from a string. In passes that do create new IdStrings, this is done to create new identifiers based on the identifiers already present in the design or we’re talking about frontend or frontend-adjacent passes that necessarily have to create IdStrings for user provided identifiers. The IdStrings created there are generally long lived, so it wouldn’t be an issue if we end up extending their lifetime for a bit.

Unnamed Auto-IDs

My proposal would thus be to partition the 32-bit IdString space into 31-bit of named identifiers, which resolve to a stored string and are reference counted and 31-bit of auto-IDs which have no backing storage and live forever. (There is also a small sub-range of statically allocated IDs, which do have backing storage but don’t need reference counting, so we could allocate those from either range depending on which extra check we want to avoid.)

Unnamed NEW_ID

We then change NEW_ID to return those auto-IDs. This would mean we no longer see the source location of the NEW_ID in the cell name, but I would argue that this has been wasteful anyway. Instead, for debugging, we could optionally track the auto-id to source location mapping separately using some dict<IdString, std::pair<const char *, int>> which would also avoid any per NEW_ID allocations even if we track that debug information. (What we lose is the ability to track the IdString lifetimes of this auto-id to source location mapping, but since this is only for debugging and there are workarounds if this does end up causing excessive memory use, that seems a price worth paying).

Moving ID ownership out of IdString

A remaining issue at this point is that we would still frequently update reference counts for the IdStrings that contain user provided identifiers. We could avoid that if we turn IdString into an int wrapper with trivial copying and destructor. To manage the lifetime of IdStrings, we’d introduce a separate OwnedIdString, which does update reference counts, but will now use atomic reference counting. To make sure that the IdStrings we’re working with during optimization stay alive, we’d add a pool<OwnedIdString> owned_ids to Module and add all contained non-auto-ID IdStrings of added module items to that (maybe with a way to opt out of that for code that is known to not create new non-auto-IDs). We’d defer removal to an explicit gc_owned_ids() that e.g. opt_clean could invoke, ensuring we don’t end up constructing and deleting OwnedIdString all the time.

One remaining issue is the IdString(const char*) constructor (and variants of it). For that my suggestion would be to

  • avoid calling it when it’s just used for lookup, instead we’d have a IdString IdString::lookup(const char*) that returns the empty IdString when no such ID already exists. In many places the temporary-IdString-for-lookup construction happens as an implicit conversion, so for a lot of instances we wouldn’t have to change passes to call IdString::lookup but instead could add new overloads to getPort( etc.
  • ensure it keeps working by adding a global pool<OwnedIdString> orphan_ids and making the IdString(const char*) constructor add the new ID to orphan_ids. We could still clear oprhan_ids between top-level passes since there should be no IDs stored outside of the design between top-level pass invocations.
  • deprecated that constructor, to make callers pick a better place to store the OwnedIdString or avoid needing it in the first place, so we can eventually get rid of orphan_ids

This would still require concurrent access to the hash table used for finding existing IDs and to allocate and insert new IDs, but it would skew almost entirely towards read-only accesses so we could protect this with a std::shared_mutex (or something equivalent if that isn’t sufficiently optimized for read-heavy workloads and suffers from contention between concurrent readers).


I’m not sure if this is the best approach, I only played around with some [[deprecated]] attributes and adding new overloads to gather some statistics, but I haven’t tried actually implementing this. From what I can tell so far, though, this is something we could realistically retrofit. Existing code should keep working, most should keep working well, but some code might end up leaking too many orphan_ids. Since that code would also see new deprecation warnings, it shouldn’t be hard to identify and fix that and there shouldn’t be too many affected places.

I might have missed something, so I welcome any attempts at poking holes into that proposal.

For SigMap we could switch to an implementation like TSigMap without breaking users.

I’m not a huge fan of allocating std::vector per non-singleton net, but the same approach of merging the smaller into the larger set can also be done when everything is in a single flat buffer where each slot has a repr and next link, so we can go with whatever comes out ahead when benchmarked.

I also thought union-find/disjoint-sets with path compression is safe for concurrent lookups, assuming you use atomic with relaxed ordering for both reads and writes. Performance should be reasonable, you can only see limited cacheline ping-ponging here, there are only so many path compression updates to be performed until everything in the cache line is pointing at the representative, and relaxed accesses are the same as normal accesses after the compiler is done optimizing. Am I missing something here?

I fixed updict and pool lookup, and was able to fix Const. I had a good look at SigSpec, with the idea of replacing all usage of operator[] const with usage of a const_iterator that can work (efficiently) with either representation so we don’t need to unpack under the hood. But there is a lot of code calling operator[] const, some of it is tricky to convert, and there’s probably third-party plugin code that would need to go through a deprecation period before we can actually remove operator[] const. So I decided not to pursue that for now.

My thought here was to change the packed representation to include the bit offset for each chunk so we can binary search to a specific bit. With that we should be able to implement all const operations reasonably fast for both implementations, but we might need to change some const SigBit & to SigBit which I hope wouldn’t cause too many issues.

For assignments my idea would be that we eagerly pick the smaller of the two representations. The biggest issue seem to be things like the non-const operator[] where it isn’t reasonable to eagerly pick the smaller of the two representations since that will often be called in a loop. We could add a non-const repack() that picks the smaller representation, to be called after you’re done writing individual bits, to return to a packed representation, but that seems easy to forget. With the current behavior you’d end up with either representation more or less randomly, so maybe that’s not an issue.

It might also be worth just deprecating the non-const operator[] and require a modify_bits() method going forward that returns a reference wrapper that will provide access to the unpacked bits and will call repack() when destroyed.

you also need persistent fan-in and fan-out indices that allow you to efficiently navigate to cells adjacent to dirty cells.

So you want an explicit hypergraph where the hyperedges are the nets connecting cells?

But you mentioned that it took 15 iterations of the outer loop in opt_merge, so even there, starting from the second iteration, there should be a lot less work to be done when it is done incrementally.

Yes, great point.

I might have missed something, so I welcome any attempts at poking holes into that proposal.

It seems like a generally good proposal. I really like the idea of leveraging the points between passes to GC. And I really like the idea of treating “auto IDs”/NEW_ID specially, leveraging the fact that we know it should create an ID that can’t match any existing ID, so we don’t have to do a synchronized hash lookup.

(What we lose is the ability to track the IdString lifetimes of this auto-id to source location mapping, but since this is only for debugging and there are workarounds if this does end up causing excessive memory use, that seems a price worth paying).

I think we might be able to fix this. Let’s ignore compatiblity issues temporarily… What if we had no IdString refcounting at all?

  • Have a global IdString factory object with per-thread subfactories hanging off it.
  • Give the factory a “lookup_or_create” method that does the concurrent hashtable thing to return non-auto-IDs.
  • Give the factory a “lookup” method to return non-auto-IDs via the hashtable when the ID must already exist, or null.
  • Give the per-thread subfactories subranges of ID indices to hand out via a “create” method, and let them track source location information and whatever else you want, even name text. No synchronization needed.
  • Give the factory getter methods to lookup the name text and metadata for any ID. For auto IDs it can reach into the per-thread subfactories to get the info.
  • Periodically trace all existing designs to determine which IDs are dead and remove them. This can be easily parallelized.

BTW I think it might be a good idea to make the IdString index 64 bits.

This would move away from being able to manipulate independent designs on different threads, rather than toward it, which would be unfortunate, but maybe that’s not worth worrying about?

A first step could be to disable IdString refcounting and implement the GC tracing instead in opt_clean, and see how that performs?

I’m not a huge fan of allocating std::vector per non-singleton net, but the same approach of merging the smaller into the larger set can also be done when everything is in a single flat buffer where each slot has a repr and next link, so we can go with whatever comes out ahead when benchmarked.

Yes, it could go either way. At least my results show that using std::vector is not awful.

Am I missing something here?

No. I was assuming that the relaxed accesses would cause some benchmark to get measurably worse, but that might not be a valid assumption. I suppose I should test it with a microbenchmark. (Although it sounds like we eventually want explicit nets for other reasons anyway.)

My thought here was to change the packed representation to include the bit offset for each chunk so we can binary search to a specific bit.

I have been assuming that doing that binary search for every bit would cause a detectable regression on some benchmark.

Basically the problem is that just about everywhere operator[] const is used, the caller is accessing a sequence of consecutive bits, so we can implement that efficiently with the packed representation if we have somewhere to cache iterator state, but we don’t :frowning: .

I’m not worried about non-const operator[]. It’s fine for that (and any other non-const method) to modify the internal representation just like it does today.

I tested this and I think there is probably an effect. 9 runs of ./yosys -t -p ‘read_rtlil ~/tmp/big-circuit.rtlil; opt_merge;’, measuring the opt_merge elapsed time:

Nonatomic mfp (s) Atomic mfp (s)
89.84026 89.93022
89.468148 90.66727
89.142483 90.294409
89.681085 91.220939
89.188635 90.408466
89.58044 90.282015
88.674005 90.33209
88.941043 90.280866
89.14741 90.543878
Average 89.29594544 90.440017
Stddev 0.3750160627 0.3564235981

Representation for Efficient Traversal

So you want an explicit hypergraph where the hyperedges are the nets connecting cells?

Given a cell output port, I want to be able to efficiently iterate over all cell input ports that are driven by that cell output port and given a cell input port I want be able to efficiently get to the driving cell output port. If with “explicit hypergraph” you just mean some data structure that supports efficient edge traversal in any direction, that is what I want. If you have something more specific in mind, I’d need some clarification on what that would be to answer, because I can think of many different data structures that fit the general description.

I also prefer to view this as directed graph of cells with edges labeled by the ports they connect to on both ends, since for well-behaved designs there is a unique driver per net, and identifying drivers and nets simplifies things a lot. You can still represent the edge cases where that isn’t true with the help of some form of pseudo-cells, either as an implementation detail that a layer between the low-level data structures and the API used by passes translates between, or by explicitly specifying that this is how bidirectional connections / undriven nets / driver conflicts / etc. are to be represented going forward. Both approaches have some advantages and disadvantages.

As an implementation detail, it can also make sense to identify cells and output ports, by saying a cell’s single output is the concatenation of all outputs and then using separate slicing cells to project out single outputs. If you do that and then also represent all concatenation and slicing using cells, you have a very simple and uniform representation that makes it a lot easier to experiment with the underlying data structures to enable better performance for many of the tasks we care about. You could still implement an RTLIL like view on top of this, not just read only, but the performance characteristics for the APIs that provide low-level ways to manipulate the design would likely be sufficiently different that this wouldn’t be a drop-in replacement for straight-forward migration of existing passes, so I’m not sure this makes sense for Yosys, but I also wouldn’t want to rule this out.

I’m not at all sure what the best choice for adding this type of efficient traversal to Yosys / RTLIL is. I know this is roughly where I’d start a from scratch design, but at that point we’re talking about replacing RTLIL with RTLIL 2 and while that makes this part easier, it also opens up a lot of other new questions[1].

Ongoing Experiments for Efficient Traversal

We did add an experimental feature called buffer normalization (bufnorm) to Yosys about a year ago, which is entirely opt-in and gets you an incrementally updated index for getting from a wire to a driving port and ensuring that every wire is fully driven by a full output port. To ensure this it is inserting buffers and creating new private wires as necessary whenever (incremental) re-normalization is requested.

While this made development of some passes easier, I’m not too happy with it since it falls short of what’s needed, e.g. it only provides the easier of both traversal directions. The initial implementation also wasn’t robust enough to handle arbitrary RTLIL, and I fixed that recently by rewriting large parts of it, but it became also more complicated.

Prompted by just having spent some time fixing issues with bufnorm, together with the discussion in this thread, I just did some experiments with what I’ve called signorm, it’s similar to bufnorm but instead of inserting buffers, it rewrites cell input port SigSpecs to directly point to the driving wires. This itself requires traversal in both directions, so that is part of what it maintains. Currently not in a particularly well optimized way. It still inserts private wires as necessary but doesn’t need to introduce any buffers. It also has a much simpler implementation and I think it takes less effort to make an existing pass compatible with the incremental normalization, but since it needs to do some rewriting eagerly it means that fewer passes are compatible with it staying active even without requesting incremental normalization. With that I have a very hacky, somewhat limited and not very efficient prototype of an incremental opt_merge and opt_expr. The last thing I tried was to check how much issues it causes for passes I didn’t touch (enough to break everything), so I still have to revert some parts and do some minor cleanup before I can share something that demonstrates what does already work with this approach.

Overall my preliminary conclusion from this experiment is that this might very well be an easier to maintain and more powerful alternative to bufnorm (which we can still replace entirely if we port the few experimental passes that use it over to some replacement), but I don’t think anything retrofitted as an optional add-on to the otherwise unchanged RTLIL data structures can be made efficient enough to realize the performance gains that I’d like to get out of making passes incremental.

IdString Refcounting

What if we had no IdString refcounting at all?

This is relatively close to where I started to get to my proposal. I added the OwnedIdString to get to something that has an easier migration path, but if the only place to store those was a global orphan_ids it’s already looking very similar.

Regarding support for working on independent designs across threads, I think we could always make things that are global be part of some context that is stored in a thread local, that can be shared across multiple threads, but where there can be only one active context per thread. If the designs are truly independent, that would be a non-issue. It’s only code that deals with multiple designs simultaneously that would become more tricky since you need to be careful to switch the current thread’s context accordingly, but such code is relatively rare and already has to pay more attention to ownership between parts of the different designs.

I’m not yet sure what to think about assigning names to independently allocated unique IDs. I think it would be fine if you can only assign a name hint from a thread, that will already be used for printing and logging, but clearly distinguished from a final name, with some globally synchronized step you need to perform to assign actual names based on those hints, where collisions will be resolved by e.g. adding some suffix.

Overall I don’t have any major complaints (not having thought about the compatibility aspect that I’m still ignoring here).

Binary Searching in Packed SigSpecs

I wasn’t trying to suggest using binary search for every lookup and calling it day, rather I think if we are able to binary search, we can still provide an API that is mostly compatible to the current API while making sure that there are no accidentally quadratic performance cliffs in some rarer used passes. I also think eliminating ping-ponging between representations could give us some room to afford some repeated binary searching, but I’d be surprised if it’s enough to afford it for every operator[] const.

For this, I’d still make sure that we have a convenient to use and efficient iterator API, and would expect us to update all performance critical use of indexing to that. And even for iterators, I think we want to be able to binary search to a specific start and end position, since there is code that will iterate over sub-slices of potentially much larger SigSpecs.

Regarding storing iterator state, if we store per-chunk offsets, so that we can verify a stored iterator state, there are two options that would at least work, but I can’t tell at all whether they’d work well:

  1. We could cache the last accessed chunk in a relaxed atomic and then check that chunk first, if it’s not the right one move to the adjacent one in the right direction and if that’s also not it start binary search from what we already know. Under the assumptions that different threads are unlikely to be iterating over the same SigSpec concurrently that would work. If that assumption fails we’re still not doing worse than binary search asymptotically. We would end up with ping-ponging the cached state between caches, but I have no idea how much of an effect that would have overall.
  2. We could have a small number of thread-local cache slots, say 4, which store an address of the sigspec and the last accessed chunk. On access we do a branch-free linear scan to find the matching slot, falling back to the least recently overwritten slot, and then use the cached value as in 1. This adds a bit more overhead, but avoids any contention between threads and doesn’t add any extra data to every SigSpec. No idea how this would perform either.

Concurrent Finds for Path Compressing Union Find

Yes, it could go either way. At least my results show that using std::vector is not awful.

I agree that it’s not horrible and I could see it coming out ahead in some instances. I’d phrased my comment differently if that weren’t so. In any case I would expect the path compressing union find using relaxed atomics to perform better than both so that is also what I’d ultimately prefer (unless a benchmark proves me wrong of course).

I was assuming that the relaxed accesses would cause some benchmark to get measurably worse, but that might not be a valid assumption.

Since every memory access fulfills the requirements for a relaxed atomic on x64 and arm64, the only reason to see performance differences here is because the effect it has on legal (or implemented) compiler optimizations. Still, this is something where I’d also expect an effect, but I wouldn’t expect an easily measurable effect.

I have stared at the disassembly for union find with relaxed atomics before, and the only difference I saw was that you lose elimination of redundant reads, both for values read and values written before.

That is an optimization that you do benefit from with union find if whenever you perform multiple identical lookups in a row and that is visible to the compiler. That is not as rare as it might sound, because union calls find on both arguments and it’s not uncommon for the caller to union to already have called find in which case the compiler can figure out that there are two cases for the preceding find that both result in the read value being known and identical to the find argument.

Even if I don’t expect that to be significant, it is possible to avoid this. We only need atomic accesses for the shared lookups, anything that happens while we perform exclusive accesses, which includes the case where we expect the missed compiler optimizations to hurt the most, doesn’t need atomic accesses. This needs either C++20’s std::atomic_ref or alternatively compiler builtins. This is all well defined behavior according to the requirements of std::atomic_ref, or maybe easier to check because you can implement this in Rust using the stdlib’s atomics without using unsafe. It should even be possible to get clang and gcc to eliminate redundant shared lookups by adding a [[gnu::pure]] attribute but there is no equivalent in Rust which I had been using when I inspected the disassembly for this.

Knowing that, I first tried to reproduce numbers similar to the ones you shared. After I thought I did that, I did apply the changes I described on top of your branch to see if I that makes the effect disappear, which it seemed to do, so everything according to my theory. But when I tested with both C++17 and C++20 I realized that for me all of this only reproduced for C++17 and in C++20 mode your atomic branch was reliably within a stddev of main, but ahead(!), which makes only sense if what I’m seeing are random changes in code layout and not any actual effect due to optimization of normal vs relaxed atomic accesses:

Benchmark 1: yosys-main -p 'read_rtlil rsd.tm.il; opt_merge'
  Time (mean ± σ):      7.497 s ±  0.060 s    [User: 7.138 s, System: 0.314 s]
  Range (min … max):    7.389 s …  7.565 s    10 runs

Benchmark 2: yosys-atomic-mfp -p 'read_rtlil rsd.tm.il; opt_merge'
  Time (mean ± σ):      7.806 s ±  0.048 s    [User: 7.447 s, System: 0.310 s]
  Range (min … max):    7.698 s …  7.864 s    10 runs

Benchmark 3: yosys-atomic-mfp2 -p 'read_rtlil rsd.tm.il; opt_merge'
  Time (mean ± σ):      7.516 s ±  0.045 s    [User: 7.164 s, System: 0.307 s]
  Range (min … max):    7.460 s …  7.574 s    10 runs

Summary
  yosys-main -p 'read_rtlil rsd.tm.il; opt_merge' ran
    1.00 ± 0.01 times faster than yosys-atomic-mfp2 -p 'read_rtlil rsd.tm.il; opt_merge'
    1.04 ± 0.01 times faster than yosys-atomic-mfp -p 'read_rtlil rsd.tm.il; opt_merge'



Benchmark 1: yosys-main-c++20 -p 'read_rtlil rsd.tm.il; opt_merge'
  Time (mean ± σ):      7.600 s ±  0.072 s    [User: 7.241 s, System: 0.309 s]
  Range (min … max):    7.514 s …  7.765 s    10 runs

Benchmark 2: yosys-atomic-mfp-c++20 -p 'read_rtlil rsd.tm.il; opt_merge'
  Time (mean ± σ):      7.514 s ±  0.062 s    [User: 7.165 s, System: 0.305 s]
  Range (min … max):    7.443 s …  7.593 s    10 runs

Benchmark 3: yosys-atomic-mfp2-c++20 -p 'read_rtlil rsd.tm.il; opt_merge'
  Time (mean ± σ):      7.532 s ±  0.058 s    [User: 7.177 s, System: 0.310 s]
  Range (min … max):    7.422 s …  7.613 s    10 runs

Summary
  yosys-atomic-mfp-c++20 -p 'read_rtlil rsd.tm.il; opt_merge' ran
    1.00 ± 0.01 times faster than yosys-atomic-mfp2-c++20 -p 'read_rtlil rsd.tm.il; opt_merge'
    1.01 ± 0.01 times faster than yosys-main-c++20 -p 'read_rtlil rsd.tm.il; opt_merge'


  1. I do think it could make sense to introduce a major revision of RTLIL, but I also think it will require more experimentation and iterating on a design before it makes sense to commit to that and before that could lead to any user-visible improvements. ↩︎

A hypergraph is just a graph where an edge can connect more than two nodes. So what you want can be described as a directed hypergraph, with the extra restriction that each edge has at most one driver.

Thanks for all this info!!

I’ll start a separate thread about removing IdString refcounting.

Ok, makes sense.

I like this idea, combined with trying to ensure that any multithreaded code uses an explicit iterator instead of indexing with operator[] const.

Ok, it sounds like I should just go ahead and submit that as a PR then.

Overall it seems like there’s some hope we can fix RTLIL “in place” so const methods are thread-safe. If we can achieve that, parallel opt_merge becomes a much smaller change. So let’s try pushing in this direction.

Ah, it seems I didn’t quite get my point across. I am familiar with the concept, having implemented canonical hypergraph labeling before, but I wasn’t sure whether your question about an “explicit hypergraph” representation was primarily about the “explicit”, about the “hypergraph” part, or both, so I decided it would be easier to describe what I was thinking of instead of guessing

What I was trying to point out, is that we can also add the restriction that every port is connected to at most one directed hyperedge. With that, if we label every incidence with a port name, which we have to do anyway if our vertices are cells, we can split the directed hyperedges into multiple simple edges without losing any information.

Thus, a directed hypergraph is more general than what’s required here, so if we only consider data structures that can represent directed hypergraphs we would miss some useful choices, and we would have an explicit invariant to maintain on top of the general directed hypergraph. Whereas the directed edge-labeled multigraph approach precisely captures what I was writing about without the need for any additional invariant.

I also like to combine a graph perspective with a relational perspective where the difference that I’m talking about becomes even clearer, IMO. With only fixed-arity relations, to represent a hypergraph, you need to assign some form of hyperedge ID (which could correspond to an address in an actual implementation). Assuming we label incidences with port names, you would then have a relation (Vertex, Label, Hyperedge) for the vertex→edge incidences and a relation (Hyperedge, Label, Vertex) for the edge→vertex incidences.

The restriction that you mention is that we have exactly one incoming incidence per hyperedge, so we have the functional dependency Hyperedge → (Vertex, Label). The restriction that I want to add, because it simplifies so many things and is already used by bufnorm and signorm, is that we can’t connect the same cell port to multiple hyperedges, so we also have a functional dependency (Vertex, Label) → Hyperedge. At that point we can identify a Hyperedge by a (Vertex, Label) pair, and substituting that makes the vertex→edge incidences entirely redundant and turns the edge→vertex incidence relation into (Vertex, Label, Label, Vertex). That is also how you’d model an edge-labeled directed multigraph relationally, and the reason why I prefer that as a graph model.

I’ll start a separate thread about removing IdString refcounting.

Makes sense, I want to bring that up also internally to double check that I’m not missing anything, and so that we don’t have to be overly conservative to address hypothetical but unlikely breakage but can focus on not breaking actual and expected use cases. I’ll respond to the new thread afterwards.

Ok, it sounds like I should just go ahead and submit that as a PR then.

[…]

Overall it seems like there’s some hope we can fix RTLIL “in place” so const methods are thread-safe. If we can achieve that, parallel opt_merge becomes a much smaller change. So let’s try pushing in this direction.

I agree, that sounds like a good plan for the next steps.