Representation for Efficient Traversal
So you want an explicit hypergraph where the hyperedges are the nets connecting cells?
Given a cell output port, I want to be able to efficiently iterate over all cell input ports that are driven by that cell output port and given a cell input port I want be able to efficiently get to the driving cell output port. If with “explicit hypergraph” you just mean some data structure that supports efficient edge traversal in any direction, that is what I want. If you have something more specific in mind, I’d need some clarification on what that would be to answer, because I can think of many different data structures that fit the general description.
I also prefer to view this as directed graph of cells with edges labeled by the ports they connect to on both ends, since for well-behaved designs there is a unique driver per net, and identifying drivers and nets simplifies things a lot. You can still represent the edge cases where that isn’t true with the help of some form of pseudo-cells, either as an implementation detail that a layer between the low-level data structures and the API used by passes translates between, or by explicitly specifying that this is how bidirectional connections / undriven nets / driver conflicts / etc. are to be represented going forward. Both approaches have some advantages and disadvantages.
As an implementation detail, it can also make sense to identify cells and output ports, by saying a cell’s single output is the concatenation of all outputs and then using separate slicing cells to project out single outputs. If you do that and then also represent all concatenation and slicing using cells, you have a very simple and uniform representation that makes it a lot easier to experiment with the underlying data structures to enable better performance for many of the tasks we care about. You could still implement an RTLIL like view on top of this, not just read only, but the performance characteristics for the APIs that provide low-level ways to manipulate the design would likely be sufficiently different that this wouldn’t be a drop-in replacement for straight-forward migration of existing passes, so I’m not sure this makes sense for Yosys, but I also wouldn’t want to rule this out.
I’m not at all sure what the best choice for adding this type of efficient traversal to Yosys / RTLIL is. I know this is roughly where I’d start a from scratch design, but at that point we’re talking about replacing RTLIL with RTLIL 2 and while that makes this part easier, it also opens up a lot of other new questions.
Ongoing Experiments for Efficient Traversal
We did add an experimental feature called buffer normalization (bufnorm) to Yosys about a year ago, which is entirely opt-in and gets you an incrementally updated index for getting from a wire to a driving port and ensuring that every wire is fully driven by a full output port. To ensure this it is inserting buffers and creating new private wires as necessary whenever (incremental) re-normalization is requested.
While this made development of some passes easier, I’m not too happy with it since it falls short of what’s needed, e.g. it only provides the easier of both traversal directions. The initial implementation also wasn’t robust enough to handle arbitrary RTLIL, and I fixed that recently by rewriting large parts of it, but it became also more complicated.
Prompted by just having spent some time fixing issues with bufnorm, together with the discussion in this thread, I just did some experiments with what I’ve called signorm, it’s similar to bufnorm but instead of inserting buffers, it rewrites cell input port SigSpecs to directly point to the driving wires. This itself requires traversal in both directions, so that is part of what it maintains. Currently not in a particularly well optimized way. It still inserts private wires as necessary but doesn’t need to introduce any buffers. It also has a much simpler implementation and I think it takes less effort to make an existing pass compatible with the incremental normalization, but since it needs to do some rewriting eagerly it means that fewer passes are compatible with it staying active even without requesting incremental normalization. With that I have a very hacky, somewhat limited and not very efficient prototype of an incremental opt_merge and opt_expr. The last thing I tried was to check how much issues it causes for passes I didn’t touch (enough to break everything), so I still have to revert some parts and do some minor cleanup before I can share something that demonstrates what does already work with this approach.
Overall my preliminary conclusion from this experiment is that this might very well be an easier to maintain and more powerful alternative to bufnorm (which we can still replace entirely if we port the few experimental passes that use it over to some replacement), but I don’t think anything retrofitted as an optional add-on to the otherwise unchanged RTLIL data structures can be made efficient enough to realize the performance gains that I’d like to get out of making passes incremental.
IdString Refcounting
What if we had no IdString refcounting at all?
This is relatively close to where I started to get to my proposal. I added the OwnedIdString to get to something that has an easier migration path, but if the only place to store those was a global orphan_ids it’s already looking very similar.
Regarding support for working on independent designs across threads, I think we could always make things that are global be part of some context that is stored in a thread local, that can be shared across multiple threads, but where there can be only one active context per thread. If the designs are truly independent, that would be a non-issue. It’s only code that deals with multiple designs simultaneously that would become more tricky since you need to be careful to switch the current thread’s context accordingly, but such code is relatively rare and already has to pay more attention to ownership between parts of the different designs.
I’m not yet sure what to think about assigning names to independently allocated unique IDs. I think it would be fine if you can only assign a name hint from a thread, that will already be used for printing and logging, but clearly distinguished from a final name, with some globally synchronized step you need to perform to assign actual names based on those hints, where collisions will be resolved by e.g. adding some suffix.
Overall I don’t have any major complaints (not having thought about the compatibility aspect that I’m still ignoring here).
Binary Searching in Packed SigSpecs
I wasn’t trying to suggest using binary search for every lookup and calling it day, rather I think if we are able to binary search, we can still provide an API that is mostly compatible to the current API while making sure that there are no accidentally quadratic performance cliffs in some rarer used passes. I also think eliminating ping-ponging between representations could give us some room to afford some repeated binary searching, but I’d be surprised if it’s enough to afford it for every operator[] const.
For this, I’d still make sure that we have a convenient to use and efficient iterator API, and would expect us to update all performance critical use of indexing to that. And even for iterators, I think we want to be able to binary search to a specific start and end position, since there is code that will iterate over sub-slices of potentially much larger SigSpecs.
Regarding storing iterator state, if we store per-chunk offsets, so that we can verify a stored iterator state, there are two options that would at least work, but I can’t tell at all whether they’d work well:
- We could cache the last accessed chunk in a relaxed atomic and then check that chunk first, if it’s not the right one move to the adjacent one in the right direction and if that’s also not it start binary search from what we already know. Under the assumptions that different threads are unlikely to be iterating over the same SigSpec concurrently that would work. If that assumption fails we’re still not doing worse than binary search asymptotically. We would end up with ping-ponging the cached state between caches, but I have no idea how much of an effect that would have overall.
- We could have a small number of thread-local cache slots, say 4, which store an address of the sigspec and the last accessed chunk. On access we do a branch-free linear scan to find the matching slot, falling back to the least recently overwritten slot, and then use the cached value as in 1. This adds a bit more overhead, but avoids any contention between threads and doesn’t add any extra data to every SigSpec. No idea how this would perform either.
Concurrent Finds for Path Compressing Union Find
Yes, it could go either way. At least my results show that using std::vector is not awful.
I agree that it’s not horrible and I could see it coming out ahead in some instances. I’d phrased my comment differently if that weren’t so. In any case I would expect the path compressing union find using relaxed atomics to perform better than both so that is also what I’d ultimately prefer (unless a benchmark proves me wrong of course).
I was assuming that the relaxed accesses would cause some benchmark to get measurably worse, but that might not be a valid assumption.
Since every memory access fulfills the requirements for a relaxed atomic on x64 and arm64, the only reason to see performance differences here is because the effect it has on legal (or implemented) compiler optimizations. Still, this is something where I’d also expect an effect, but I wouldn’t expect an easily measurable effect.
I have stared at the disassembly for union find with relaxed atomics before, and the only difference I saw was that you lose elimination of redundant reads, both for values read and values written before.
That is an optimization that you do benefit from with union find if whenever you perform multiple identical lookups in a row and that is visible to the compiler. That is not as rare as it might sound, because union calls find on both arguments and it’s not uncommon for the caller to union to already have called find in which case the compiler can figure out that there are two cases for the preceding find that both result in the read value being known and identical to the find argument.
Even if I don’t expect that to be significant, it is possible to avoid this. We only need atomic accesses for the shared lookups, anything that happens while we perform exclusive accesses, which includes the case where we expect the missed compiler optimizations to hurt the most, doesn’t need atomic accesses. This needs either C++20’s std::atomic_ref or alternatively compiler builtins. This is all well defined behavior according to the requirements of std::atomic_ref, or maybe easier to check because you can implement this in Rust using the stdlib’s atomics without using unsafe. It should even be possible to get clang and gcc to eliminate redundant shared lookups by adding a [[gnu::pure]] attribute but there is no equivalent in Rust which I had been using when I inspected the disassembly for this.
Knowing that, I first tried to reproduce numbers similar to the ones you shared. After I thought I did that, I did apply the changes I described on top of your branch to see if I that makes the effect disappear, which it seemed to do, so everything according to my theory. But when I tested with both C++17 and C++20 I realized that for me all of this only reproduced for C++17 and in C++20 mode your atomic branch was reliably within a stddev of main, but ahead(!), which makes only sense if what I’m seeing are random changes in code layout and not any actual effect due to optimization of normal vs relaxed atomic accesses:
Benchmark 1: yosys-main -p 'read_rtlil rsd.tm.il; opt_merge'
Time (mean ± σ): 7.497 s ± 0.060 s [User: 7.138 s, System: 0.314 s]
Range (min … max): 7.389 s … 7.565 s 10 runs
Benchmark 2: yosys-atomic-mfp -p 'read_rtlil rsd.tm.il; opt_merge'
Time (mean ± σ): 7.806 s ± 0.048 s [User: 7.447 s, System: 0.310 s]
Range (min … max): 7.698 s … 7.864 s 10 runs
Benchmark 3: yosys-atomic-mfp2 -p 'read_rtlil rsd.tm.il; opt_merge'
Time (mean ± σ): 7.516 s ± 0.045 s [User: 7.164 s, System: 0.307 s]
Range (min … max): 7.460 s … 7.574 s 10 runs
Summary
yosys-main -p 'read_rtlil rsd.tm.il; opt_merge' ran
1.00 ± 0.01 times faster than yosys-atomic-mfp2 -p 'read_rtlil rsd.tm.il; opt_merge'
1.04 ± 0.01 times faster than yosys-atomic-mfp -p 'read_rtlil rsd.tm.il; opt_merge'
Benchmark 1: yosys-main-c++20 -p 'read_rtlil rsd.tm.il; opt_merge'
Time (mean ± σ): 7.600 s ± 0.072 s [User: 7.241 s, System: 0.309 s]
Range (min … max): 7.514 s … 7.765 s 10 runs
Benchmark 2: yosys-atomic-mfp-c++20 -p 'read_rtlil rsd.tm.il; opt_merge'
Time (mean ± σ): 7.514 s ± 0.062 s [User: 7.165 s, System: 0.305 s]
Range (min … max): 7.443 s … 7.593 s 10 runs
Benchmark 3: yosys-atomic-mfp2-c++20 -p 'read_rtlil rsd.tm.il; opt_merge'
Time (mean ± σ): 7.532 s ± 0.058 s [User: 7.177 s, System: 0.310 s]
Range (min … max): 7.422 s … 7.613 s 10 runs
Summary
yosys-atomic-mfp-c++20 -p 'read_rtlil rsd.tm.il; opt_merge' ran
1.00 ± 0.01 times faster than yosys-atomic-mfp2-c++20 -p 'read_rtlil rsd.tm.il; opt_merge'
1.01 ± 0.01 times faster than yosys-main-c++20 -p 'read_rtlil rsd.tm.il; opt_merge'