Archetypal vs Grouped ECS Architectures, my take

(Thomas Gillen) #1

Links for context:

An ECS library can essentially thought of as an API for performing a loop over a homogeneous set of entities, filtering them by some condition, and pulling out a subset of the data associated with each entity. The goal of the library is to provide a usable API for this, and to do it as fast as possible. Performance is largely determined by cache hit/miss rate, which in turn is determined by the amount of memory accessed, and how linear and sequential those accesses are.

Both “archetypal” and “grouped” ECS architectures arrange each type of component in its own packed array, such that the scene as a whole is formed as a structure of these arrays. They both aim to perform these entity loops by iterating through these component arrays for the required components linearly and in-order, and doing this without loading unnecessary data and without skipping to an unpredicatable memory address (and thus causing a prefetch misprediction and a cache miss). How successful they are at this is a good measure of what you should expect from their performance.

An archetypal ECS splits each component array into sub-slices such that all of the entities in each slice have the same component layout (an “archetype”). This allows each slice to be indexed together with the corresponding slices of other component types. During a loop, therefore, performance can be summarised as never loading unnecessary data, but one jump to a new memory location (and thus a likely cache miss) happens per component, per unique entity layout which exists within the result set of the loop. In my experience, slices usually contain between a few dozen and a few hundred components, so there is a cache miss caused by archetype fragmentation once every that many number of entities.

A grouped ECS keeps all components of a single type in a single array. However, it divides this array into two slices. One slice contains all components which belong to entities which also have all of the components that belong to the same “group” as the component in question. The other slice contains the components for all entities which do not have all of the group’s components (they may have some, but not others, or none). Thus, each component type can be assigned to one group with other component types. Each group can recursively be subdivided into exactly one more sub-group by adding one or more additional components to define a new larger group (which match a subset of the parent group’s entities). A loop through entities then has two performance modes: a fast path and a slow path. If the loop requests only components which are an exact match to any group, then the loop can achieve optimal performance; it will only load component data which is needed and the full iteration will be entirely linear. However, if the loop requires any components which are not part of a single group, then it will perform indirections. Usually, excepting some “non-owned” group cases, the loop will pick the component type which has the fewest instances in the scene. It will then loop through all entities with that component, and for each component it will perform a lookup in an indirection table to determine the index (and existence) of the component, and then fetch the component. Therefore, the performance can be summarised as fetching significantly more data than needed, plus one likely cache miss per component type per entity, plus one more likely cache miss per entity per component for entities which do in fact have the desired components. EnTT allows looser group definitions which can omit the indirection table lookup to determine if the entity has each component, and potentially restrict the loop more tightly to those which belong to the parent group, but the indirect component accesses will still be out of order (and there are limits on when these groups can be used).

The difference in performance between the fast and slow paths of a grouped ECS is very significant (there are benchmarks down below). The issue I have with the architecture is that in my experience working with production game ECS code is that a very large portion of all loops will fall onto the slow path. More importantly, this issue is unavoidable. You can’t realistically refactor the code to fit the ECS architecture and avoid the issue. You can only define groups to optimise for a small portion of loops, and the cost to unoptimised loops can be an order of magnitude hit to performance.

The problem is most clearly demonstrated with a component such as Position. Such a component is used pervasively throughout most game code. It is probable that there are several dozen, perhaps hundreds of loops in a realistic scale game code base which access the Position component. These loops are spread throughout many totally different modules of the game, and access completely different components together with Position. Perhaps the loop which computes new transformation matrices from Position and Orientation is the one which touches the most entities, so you define a <Position, Orientation, Transform> group. Now that loop performs optimally. However, one of the loops in your physics engine needs <Position, Velocity, Acceleration>. Position cannot belong to both groups. This loop, even though it likely accesses almost as many entities as the transform loop, will now fall onto the slow path. Similarly, a loop in the pathfinding module which finds an entity’s current navmesh polygon is going to end up on the slow path. A piece of game logic which opens a door if an entity is nearby is now on the slow path. As is almost every loop that requires Position in the game, even though many of these loops are almost as large as the one which has been optimised.

Position may be the most obvious culprit, but there are many such components in most games. Even for “private” components which are used to track state within a single module, it is common for that module to use that component in multiple internal loops, and only some of those will fall under the nested sub-group special cases which allow them to benefit from the layout optimisation of the optimised loop.

The problem is in fact worse than this. Take our transform update loop for example. It calculates a new transform matrix for entities with Position and Orientation. But your game most likely also has entities with Position and a Transform, but not Orientation. You can use a significantly simpler update loop for these entities, as you can omit all of the matrix calculations and simply write the translation vector into the matrix. How do you write these two loops? Now consider varients of both of these, but with an additional Scale component.

What you want is some way to iterate through all entities with Position and Translation but exclude those with Orientation. This allows you to replace dynamic branching on the existence of the orientation component with two linear batched loops. An archetypal ECS can do this very easily - it simply includes the exclusion in its archetype filtering logic. Each loop then does not access any component data for entities which are ultimately unwanted, and the per-archetype slices that it does match are iterated linearly without performing per-entity checks or skipping any elements. A grouped ECS, however, cannot do this naturally. It will need to perform an indirection loop for all loops which want to use filter conditions such as component exclusions (even if the components actually accessed are a group), or run the filter logic per-entity (which multiplies overheads and causes accesses to no longer be sequential). In my experience, these loops account for perhaps 1 in 5 loops in a game.

The other side of the performance equation is entity insertion and removal performance. Both archetypal and grouped ECS architectures should inherently peform similarly for insertion or deletion of new entities. However, archetypal ECSs must move all components for an entity into a new archetype storage location when a component is added or removed from an existing entity. A grouped ECS will perform a similar operation if the modification moves the entitiy in or out of a group, but otherwise the component insert/remove is significantly cheaper.

In summary, in my opinion, if your game either performs component insertion/removal for existing entities for a large portion of the total entities in the scene every frame (and those components are not in a group), OR if the loops in your game are near exclusively accessing private components which no other loop accesses, then a grouped ECS may work very well for you. However, in the general case I believe the performance impact of falling off the fast path, and the difficulty of avoiding this across all performance critical parts of a large and complex game project, is greatly underestimated. Legion’s archetypal loops perform close to the fast path performance of EnRS, but it maintains that performance more reliably across a greater variety of situations and does not have sudden and severe performance pitfalls to avoid.

OvermindL1 has done a lot of good work on porting EnTT to rust with EnRS (and it would be nice if we could have a read through the code ;)), and has done some benchmarking comparing some ECS libraries which implement both architecutres.

Some benchmark results comparing EnRS (grouped ECS, on its fast path) and legion (archetypal):

Entity-Iterate-10_000/EnRS-OwningGroup/9
                        time:   [1.4854 us 1.5125 us 1.5439 us]
Entity-Iterate-10_000/Specs/9
                        time:   [61.393 us 67.829 us 77.196 us]
Entity-Iterate-10_000/Legion/9
                        time:   [3.6240 us 3.7318 us 3.8698 us]
Entity-Iterate-10_000/EnRS-OwningGroup/10
                        time:   [41.202 us 41.439 us 41.719 us]
Entity-Iterate-10_000/Specs/10
                        time:   [253.61 us 267.73 us 282.68 us]
Entity-Iterate-10_000/Legion/10
                        time:   [52.814 us 53.251 us 53.841 us]
Entity-Iterate-10_000/EnRS-OwningGroup/7
                        time:   [18.512 us 18.688 us 18.898 us]
Entity-Iterate-10_000/Specs/7
                        time:   [59.487 us 63.663 us 69.482 us]
Entity-Iterate-10_000/Legion/7
                        time:   [4.0805 us 4.3267 us 4.6397 us]
Entity-Iterate-10_000/EnRS-OwningGroup/8
                        time:   [90.072 us 91.346 us 93.184 us]
Entity-Iterate-10_000/Specs/8
                        time:   [208.96 us 299.24 us 466.74 us]
Entity-Iterate-10_000/Legion/8
                        time:   [115.29 us 118.12 us 121.49 us]

I expect in all four of these results, both legion and EnRS could potentially switch places with further optimisation work to either library.

And a benchmark when a loop falls off the fast path (plus Shipyard, another grouped ECS):

Entity-Iterate-10_000/EnRS-Indirect/10                                                                            
                        time:   [427.53 us 430.69 us 433.92 us]
Entity-Iterate-10_000/EnRS-OwningGroup/10                                                                            
                        time:   [38.392 us 38.781 us 39.189 us]
Entity-Iterate-10_000/Specs/10                                                                           
                        time:   [324.36 us 344.22 us 364.56 us]
Entity-Iterate-10_000/Legion/10                                                                            
                        time:   [52.360 us 53.232 us 53.918 us]
Entity-Iterate-10_000/Shipyard/10                                                                            
                        time:   [416.09 us 420.04 us 425.04 us]
2 Likes
(OvermindDL1) #2

To provide more definition between archetype and grouped for anyone else reading:

  • Archetype: Takes all the components that an Entity has and packs them conceptually into a single struct, and stores that struct containing all the component data for the Entity. This makes general iteration over these fast as you only need to test what you want to the overall ‘chunk’ rather than another way, but it does have a couple of costs where adding/removing components (very common in most ecs patterns) is extremely expensive as it has to be copied in full to another chunk, and it means you are loading into the CPU cache and then skipping over the data you don’t want (I.E. a huge stride size over the data size).

  • Grouped: This is where each component is in it’s own data structure, most generally a dense Vec with a secondary index that is rarely used. The 'group’ing part of this is where you ‘assign’ a set of components, say A, B, and C together to a group and so the entities they share together get sorted at either the beginning or end of each component pool, means they can be iterated without checking validity as they are already known valid and since they are packed they are loaded in the CPU cache extremely efficiently and the stride length == data length so no needless reads. They also have fast changing of components as well (updating a group is at most a single swap in each grouped pool, if not less). You can also have sub groups within a group, so of the A/B/C group you can also have a D/E within it, to make a A/B/C/D/E subgroup, and since it is completely contained within the A/B/C group then you can subsort it’s range within the main group for fast iteration of it. However, you can’t share components between groups (so you can’t, for example, have an A/B/C and A/B/D groups, though in this case you can have an A/B group and have either a C or D subgroup, whichever you expect to be more common, and use a secondary index lookup for the more rare one), for one version or the other you have to hit the secondary index to perform the lookup, which is slower.

In general, archetype ECS’s sacrifice all else for iteration speed of ‘all’ related data of a component, it has a slight speed hit when not accessing some data (as it still gets read into the CPU cache) and it is extremely slow for adding/removing components. Archetype ECS’s are fantastic for dataflow style games that have little to no component changes (think Factorio’s current engine design, though its argued the flexibility factorio would get with a grouping ecs design would make for far more efficient mods).

In general, grouping ECS’s try to get the iteration bonus of archetype ECS’s while gaining fast adding/removing component speeds, though the sacrifice is you have to know how for your specific game the best way to group the components and you might have to hit secondary index’s at time depending on the game design. Grouping ECS’s are fantastic for more flexible style games where components change somewhat routinely or more often (think heavily modded Minecraft games for example).

In my opinion of having used both styles over a very long period of time if the programmer is willing to design their game around it then grouping ECS option can be significantly more flexible in design and great for supporting modding at the expense of needing to know what to group to make the fast paths the fastest, and in many cases can be quite a bit faster than archetype style ECS’s, otherwise if you just want to dump components in and not worry about, especially if you are going to have a ‘single archetype per type of game object’ (though with this you don’t need an ECS at all, just use the dataflow pattern directly, such as factorio is doing, it has even less overhead than ECS’s though less flexibility in no component changes at all), then archetype isn’t bad. In my experience though, component changes happen, and they tend to happen a lot more often then you’d think as the games increase in size as they are great for storing temporary or transient states and more.

I’ve significantly experienced the opposite, setup the groups with a good engine design well and the fast paths are near always if not exclusively hit.

Such a Position component is actually a good example of the given things that they are common enough that you’d probably want to give all positionable entities all of Orientation, Transform, Velocity, and Acceleration, which means you can make a group out of ‘all’ of those. (Often I’ll combine Position/Orientation/Transform all into one component personally.) In comparison to an Archetype ECS this group iteration will still efficiently go over all the entities on the ‘fast’ path in addition to it will skip data you are not iterating. For example I’d probably do this:

  • Components of: Position, Orientation, Transform, Velocity, and Acceleration.
  • Groups of:
    • Group A: Position + Orientation + Transform (as everything will have these)
    • Group AB: Position + Orientation + Transform + Velocity + `Acceleration (probably only moving things will have these, but they could be on everything of course, usually velocity/acceleration is in a physics system link for my usual setups)

When you want to iterate over Position/Orientation/Transform then just do the equivalent of group_a.iter::<(Position, Orientation, Transform)>(). When you want to iterate over Position/Velocity/Acceleration then you’d do group_ab.iter::<(Position, Velocity, Acceleration)>(), and in each case only what you are wanting to access gets loaded into the CPU cache, making it as efficient as possible.

This same pattern holds for a vast amount of component designs, and if you keep grouping in mind when actually ‘making’ components then it remains on the ‘fast path’ in near all if not outright all cases. Like say it’s a minecraft style game, and you want a block that looks for entities above it to set on fire on occasion, you might have an ApplyEffectInRelation component (or ApplyFireInRelation or whatever), and you’d make a group extended from the above group A with that effect. Sure you don’t care about Orientation/Transform, but anything with a position will have those as well (and even if they don’t need them in the extreme rare cases then the identities for those are fine) and iterating over position and Apply*InRelation will iterate just over those, not everything else an entity might have and skipping over the unwanted data.

In general you can have many many groups, and many many subgroups even of a single group as long as the subgroups are disjoint. Keeping the groups fine grained (generally only 1 or 2 components per subgroup ‘level’) makes it very easy to make fast paths where you might not have originally expected.

Even then, the slow paths aren’t that slow, let things remain as such for the not-often components, you can add a group there if the need arises (even dynamically during the gameplay if you so want, they aren’t hardcoded).

Not just ‘some’ but the great deal ‘most’, potentially even ‘all’ if the game is designed for it.

Make both an Orientation subgroup of Position/Transform and an Exclude<Orientation> subgroup then, by having both you can iterate over each pattern efficiently, you can then have physics subgroups of each as well. Though I find this example a bit convoluted as orientation is part of transform in almost all cases, though you could use an enum to differentiate cases as well (though then you skip a bit of ‘unused data’ in some cases). Other things can be done as well.

Same in a grouping, you just make an excluded subgroup. Groups have 3 parts to them:

  • Owned: Entities with these components are not only added to this group, but sorted as well.
  • Unowned: Entities with these components are added to this group, but no sorting, secondary index access.
  • Excluded: Entities only without these components are added to this group, of course no sorting.

Not at all, these are very natural cases in grouped ECS’s as well as described above.

Except moving in or out of a group, only if it’s an owned component, means a single swap. The specific ordering of entities within a group don’t matter, it’s the ordering of that component is the same in all groups, which is a single swap operation per group (either to the beginning or end of the ‘sorted’ region depending). It’s very efficient to perform and significantly cheaper than copying all component data.

A ‘group’ does not hold the component data, it only enforces that the component pools themselves remain in the same order, which is cheaply done as when a component is added/removed, which when causes an entity to enter/exit a group, then a simple swap is performed on each (except the one that had the component just added/removed if it was an owned component as it’s already at the end of the sorted group segment then).

However adding/removing components is an extremely common case in most ECS style games, it’s extremely convenient to be able to add/remove functionality based on huge varieties of conditions. And in my benchmarks (you cited in your post) the component counts were low, where in general when I make such playground games I tend to have anywhere from ten to a hundred times that amount (more in some rare cases), and they get mixed in a significant amount of different ways, which both means the archetype iteration gets even slower due to having a large stride/wanted_data difference as well as having to test more variations of chunks for the right mix (which may waste a huge amount of memory in legion as it makes chunks of a set size even when very often only 1 variation of an archetype may exist in a signficiant number of cases, slowing further down iterations of hot paths like positional rendering and more).

In addition, a grouping ECS doesn’t need to store data as packed, such as in mine when GAT’s (nightly rust) is enabled then you can have, for example, the positions stored a packed vec for fast iteration and a secondary index of an r*-tree or so for significantly faster range and area queries. I haven’t yet seen a way to do these kind of performance enhancements in an archetype based ECS, meaning you have to manually keep these secondary lookups externally to it and try to keep it sync, especially as a variety of systems/iterations may be doing something like adjusting position.


Of course grouping isn’t best in all cases, but I do think it is the best in ‘most’ cases. Archetype I’ve tried using in the past but in almost all cases of it I ended up falling down to a basic dataflow pattern as it fit those cases even better while running faster than a chunk based archetype ECS like legion. Dataflow patterns are utterly fantastic for certain styles of games, unmatched in performance and readability when it can be exclusively used (technically ECS is a generic form of the dataflow pattern).

Porting EnTT was initially started as a learning task as I’ve used EnTT in the past a good bit (it replaced my own homegrown ECS that I made in the late 90’s as it was the only one I’d found that actually outperformed it in my engine tasks by the time I found it a bit back), so it’s code is… a bit horrifying right now, and it’s integrated into my general learning repo. In addition I also #![forbid(unsafe_code)] and constrained myself to only stable rust, which means that I’m missing out on certain features that I want (like GAT’s are needed for user defineable component storage types). I hope to get time soon to untangle it from the rest of my learning repo and make it standalone, maybe make the unstable rust features finished out and behind a feature flag as well.

For note, the Shipyard ECS made by leudz is also a port of EnTT to Rust. It does things differently than I (which gives me a ‘tiny’ speed boost in my benchmarks but in general its inconsequential) and is more fleshed out then mine (using unsafe and such), but I was planning to take mine ‘further’ (adding in more EnTT features that aren’t just ECS things as well as porting some of my own EnTT extensions to it as well), but it might be better to submit my changes to Shipyard instead. For now (and likely in the future, I’m still planning to release EnRS but undecided if I want to keep working on it or move to working on Shipyard, I need to look at its internals more to decide) definitely look at Shipyard instead, it has the same performance characteristics as mine (although it calls groups packs instead) even if it lacks one or two features than mine has right now.

For note, your benchmarks aren’t quite accurate as well, those are from when I was dispatching over a multithreaded channel each mutation in EnRS (I’m making a batcher instead), 6 times per entity (6 components), which added a lot of overhead. I haven’t added Shipyard to me entity creation benchmarks, but if it follow EnTT there I’d imagine it’s the same as mine. Here’s some current benchmarks:

New 10_000 Entity Creation : 0 components

Entity-New-10_000/EnRS/0                                                                           
                        time:   [383.96 us 386.27 us 389.66 us]
Entity-New-10_000/EnRS-batched/0                                                                           
                        time:   [57.770 us 58.046 us 58.411 us]
Entity-New-10_000/Specs/0                                                                           
                        time:   [384.24 us 387.10 us 389.54 us]
Entity-New-10_000/Specs-batched/0                                                                           
                        time:   [120.77 us 122.79 us 124.31 us]
Entity-New-10_000/Legion/0                                                                            
                        time:   [642.44 us 643.06 us 643.67 us]
Entity-New-10_000/Legion-batched/0
                        time:   [468.68 us 470.32 us 472.31 us]

New 10_000 Entity Creation : 1 component (4 bytes total)

Entity-New-10_000/EnRS/1
                        time:   [842.74 us 845.44 us 847.77 us]
Entity-New-10_000/EnRS-batched/1
                        time:   [220.81 us 221.39 us 222.38 us]
Entity-New-10_000/Specs/1
                        time:   [1.0682 ms 1.0706 ms 1.0729 ms]
Entity-New-10_000/Specs-batched/1
                        time:   [308.85 us 310.98 us 312.54 us]
Entity-New-10_000/Legion/1
                        time:   [2.0604 ms 2.0742 ms 2.0855 ms]
Entity-New-10_000/Legion-batched/1
                        time:   [541.07 us 543.85 us 547.46 us]

New 10_000 Entity Creation : 6 components (14 bytes total)

Entity-New-10_000/EnRS/6
                        time:   [3.8841 ms 3.9116 ms 3.9476 ms]
Entity-New-10_000/EnRS-batched/6
                        time:   [1.3318 ms 1.3344 ms 1.3364 ms]
Entity-New-10_000/Specs/6
                        time:   [4.5179 ms 4.5322 ms 4.5634 ms]
Entity-New-10_000/Specs-batched/6
                        time:   [1.0444 ms 1.0678 ms 1.0814 ms]
Entity-New-10_000/Legion/6
                        time:   [3.8684 ms 3.8991 ms 3.9627 ms]
Entity-New-10_000/Legion-batched/6
                        time:   [1.1274 ms 1.1396 ms 1.1540 ms]

Iteration 10_000 Entities: 1 component (4 bytes, 1 mutation on component)

Entity-Iterate-10000/EnRS-Indirect/1
                        time:   [3.7396 us 3.7603 us 3.7931 us]
Entity-Iterate-10000/EnRS-OwningGroup/1
                        time:   [3.9118 us 3.9315 us 3.9665 us]
Entity-Iterate-10000/Specs/1
                        time:   [106.03 us 110.81 us 116.52 us]
Entity-Iterate-10000/Legion/1
                        time:   [13.235 us 13.326 us 13.438 us]
Entity-Iterate-10000/Shipyard-Packed/1
                        time:   [4.8280 us 4.9031 us 5.0084 us]

Iteration 10_000 Entities: 6 component (14 bytes, only 6 components, 6 mutations, one on each component)

Entity-Iterate-10000/EnRS-Indirect/6
                        time:   [400.13 us 400.88 us 401.76 us]
Entity-Iterate-10000/EnRS-OwningGroup/6
                        time:   [68.318 us 68.676 us 69.128 us]
Entity-Iterate-10000/Specs/6
                        time:   [244.52 us 251.59 us 261.28 us]
Entity-Iterate-10000/Legion/6
                        time:   [63.802 us 64.060 us 64.404 us]
Entity-Iterate-10000/Shipyard-Packed/6
                        time:   [76.824 us 77.344 us 78.076 us]

Iteration 10_000 Entities: 6 component (14 bytes, only 6 components, however component 2 is on 1/2 of the entities, component 3 is on 1/3, component 4 on 1/4, component 5 on 1/5, and component 6 on 1/6, 6 mutations, one on each component)

Entity-Iterate-10000/EnRS-Indirect/7
                        time:   [25.049 us 25.120 us 25.206 us]
Entity-Iterate-10000/EnRS-OwningGroup/7
                        time:   [1.7331 us 1.7397 us 1.7474 us]
Entity-Iterate-10000/Specs/7
                        time:   [58.791 us 63.399 us 68.721 us]
Entity-Iterate-10000/Legion/7
                        time:   [3.7814 us 3.9859 us 4.2058 us]
Entity-Iterate-10000/Shipyard-Packed/7
                        time:   [4.4613 us 4.7342 us 4.9372 us]

Iteration 10_000 Entities: 6 components (14 bytes, 7 components exist on each, 1038 bytes total, 6 mutations, one on each component)

Entity-Iterate-10000/EnRS-Indirect/8
                        time:   [481.54 us 484.95 us 488.56 us]
Entity-Iterate-10000/EnRS-OwningGroup/8
                        time:   [69.481 us 70.267 us 71.549 us]
Entity-Iterate-10000/Specs/8
                        time:   [377.40 us 386.04 us 399.62 us]
Entity-Iterate-10000/Legion/8
                        time:   [102.51 us 103.22 us 103.98 us]
Entity-Iterate-10000/Shipyard-Packed/8
                        time:   [79.896 us 80.349 us 80.993 us]

Iteration 10_000 Entities: 6 components (14 bytes, only 6 components, however component 2 is on 1/2 of the entities, component 3 is on 1/3, component 4 on 1/4, component 5 on 1/5, and component 6 on 1/6, 6 reads, 1 on each component, no mutations)

Entity-Iterate-10000/EnRS-Indirect/9
                        time:   [27.223 us 27.692 us 28.184 us]
Entity-Iterate-10000/EnRS-OwningGroup/9
                        time:   [1.3540 us 1.3652 us 1.3738 us]
Entity-Iterate-10000/Specs/9
                        time:   [57.192 us 60.423 us 64.210 us]
Entity-Iterate-10000/Legion/9
                        time:   [3.3196 us 3.3402 us 3.3591 us]
Entity-Iterate-10000/Shipyard-Packed/9
                        time:   [5.3513 us 5.4355 us 5.5131 us]

Iteration 10_000 Entities: 6 components (14 bytes, only 6 components, 6 reads, 1 on each component, no mutations)

Entity-Iterate-10000/EnRS-Indirect/10
                        time:   [441.30 us 446.83 us 451.78 us]
Entity-Iterate-10000/EnRS-OwningGroup/10
                        time:   [39.604 us 39.779 us 40.012 us]
Entity-Iterate-10000/Specs/10
                        time:   [333.47 us 349.49 us 371.10 us]
Entity-Iterate-10000/Legion/10
                        time:   [51.512 us 51.988 us 52.501 us]
Entity-Iterate-10000/Shipyard-Packed/10
                        time:   [81.441 us 82.204 us 83.016 us]

When mine isn’t sending a channel message on every single mutation it’s a lot faster than before, lol. ^.^;

Technically the EnRS batched insertion isn’t batched, the entities are but the insertion of components arent, I need to batch those too sometime but eh it’s fast enough for now…

EDIT: Added a Legion-Packed to the new versions, even though that’s excessively hard to pull off in legion because it seems to require allocating a vec with all the components as a tuple, where specs/enrs take a function to construct as-needed so they are much easier on memory and creation of things, though it does mean it can memcpy then).

1 Like
What is the Dataflow pattern referred to in ECS discussions?
(Thomas Gillen) #3

This is not true. Components that are not accessed are not loaded at all.

Component changes do not happen frequently in heavily modded minecraft games. A tiny percentage of the total entities in a scene will go through a state transition which is well represented via a component change within an individual frame, whereas every entity is iterated over numerous times. A typical FPS game, for example, does not have more than a fraction of a percent of its active entities go through some kind of major state transition which will cause it to change its behaviour, within a frame.

More importantly, I think you are overestimating the performance differences between the two designs when it comes to these kinds of use cases:

In cases where your component add/remove will move the entity in or out of an (owned) group, the work swapping the entity’s components around the group boundry is fundamentally very similar to the work required to move the entity’s components between archetypes. In this case, the two designs converge into performing essentially the same operation.

You only gain fast component add/remove performance in cases where the component does not belong to a group. In such cases, you only retain high iteration performance in loops which only access that one component - something which will be extremely rare if you are using these high-frequency components as marker tags for state transitions. If a loop requires access to any other component, then it will perform indirection. The per-entity lookup into an indirection table, followed by an index into the actual component data array, will be significantly slower than a linear iteration through Option components with a branch inside the loop body. In comparison to that, insertions and deletions are also actually far slower than setting the option.

So there is not really any realistic situation where the grouped design gives you better performance when you are using high-frequency marker types to transition entity state. Either you have good iteration performance, and insertion times converge in both designs, or you would be better off just using Option.

It is actually easier to mix usage of Options for high-frequency state marking and component addition/removal for low frequency states in order to perform better entity culling in the loop with an archetype model than a group model, because you don’t need to worry about any combination of states pushing you out of the fast path due to incompatible group definitions (not to mention the combinatorial explosion of such group definitions).

You most certainly would not want to do that! By attaching additional components onto an entity, you are associating both data and new behaviour with that entity. New behaviour which is not needed. Entities which don’t rotate are now all having their transforms calculated with a much more expensive computation than they need. All your entities with a position are now having their acceleration and velocity integrated, even though most entities in a scene are static and will have 0 velocity. The extra work being performed here will totally overshadow any performance differences between the two ECS architectures.

Additionally, if you placed all position/orientation/transform into a single component, you are going to cause most loops which access this component to pull in far more data than they need. Most logic which needs position, for example, only needs the float3 translation - not the orientation and certainly not a full 4x4 32bit float transformation matrix. That matrix alone is an entire cache line.

Splitting data into granular pieces so that you only load data you need, and only iterating over the entities which actually need the computation you are performing are perhaps the central tenents of data oriented design that the ECS tries to facillitate. If your ECS is pushing you to organise your code sub-optimally in order to stay on its fast path, then that is a critical failure of the ECS’s design.

Excepting my above objection, which largely applies to this group layout, this is only possible in a very limited case where you can arrange your two loops such that one is running over a superset of the other. What about the other hundred loops in the game which access position? What about your AI code, gameplay logic, rendering, etc? A component can only be owned by one group, and all other cases perform indirection (some worse than others) and so are on the slow path. It is impossible to optimise for all cases, but the grouped design will eke out a small amount of extra performance in one (or two) loops at the expense of performing terrible everywhere else.

They are an order of magnitude slower in your own benchmark.

That is certainly not how any of it is described in the EnTT blog posts. If you do not keep the components that are part of a group sync’d in the same order for each entity, then you cannot iterate through the slices together. You would need to look up the correct index in an indirection table of the sparce set and then perform a random access. Doing so throws away your performance, just the same as if an archetypal ECS somehow had a unique layout for literally every entity.

To my understanding, this is what unowned groups do. The only advantage of unowned components in a group vs not being in a group at all was, as far I as could see, that the iterator does not need to iterate over all components of one type (whichever has the fewest instances) and perform the indirection check against all other components. Instead the iterator can assume that all entities in the group have all components. The accesses are still out of order, though, so this still performs poorly.

No, that is not extremely common. It is common to have entities with maybe 50 or so components because, say, your AI nagivation code alone uses a set of 12 components (and there are a few modules interacting with that entity at a similar scale). However, most entities with this AI navigation uses those same 12 components, or only 2 or 3 different variants. The number of unique layouts is not anywhere near that large, and you typically have a healthy number of entities in each chunk, with only a very small number of poorly occupied chunks.

Take for example your average FPS. You might have an entity which represents, say, a chair prop. That chair will likely have a transformation matrix component, a model, perhaps a handle to a rigid body in the physics engine, and maybe a component which indicates the material for the sound engine to play the correct sound effects if the player bumps into it. That is about it; a handful of components. Entities like this make up the majority of entities in the scene. There are going to be many other props, in different locations and with different models and physics representations, all with the same layout. That layout also virtually never changes. Maybe if the player shoots the chair, it breaks. This would likely be implemented by deleting the entity and replacing it with a few “chair parts” entities.

Even highly dynamic entities do not typically go through component layout changes. Say your game is a multiplayer game with 20 players. They are all firing machine guns simultaneously. You represent every bullet fired with an entity (which is totally reasonable). Each frame, you might have at most 20 new bullets fired, with more likely just 1 or 2. You create a handful of bullet entities. All of the active bullet entities update their positions, they update their transforms, they are rendered by a particle system, the audio engine plays the next frame of their audio and spacalises it at their current position. Some of those bullets hit something in this frame. New “bullet hit event” entities are spawned and those bullets are deleted. At no point was a component added or removed from an existing entity. However, multiple loops in multiple different modules iterated through all of those bullets.

I am not sure what kind of games you build for these toy games, they sound like they might being doing something rather interesting. But they are certainly not typical.

Allocating memory in pages is something orthogonal to grouped vs archetypal.

2 Likes
(OvermindDL1) #4

Not loaded by your code sure, but they are loaded into the CPU cache because the memory is adjacent, which eats up a significant amount of cache because you aren’t accessing that memory, which then makes accessing the rest even slower, as well as the strides of memory loaded in the closer cache’s won’t even hold a single entity at times, etc… etc…

Vanilla MC does not use an ECS. Vanilla MC PE uses EnTT interestingly, however I’ve no clue in what form they use it (I would expect quick component swapping in a variety of areas, especially the AI to determine actions based on various states, unless they went the java version way in which everything is being tested every single tick, yay efficiency…)/

It’s just just behaviour changes, components also often hold various data like what it’s targeting, what’s targeting it, what navmesh path to use, lots of events, etc… etc… A lot of which can change extremely rapidly on a lot of things. With many components in use that means that a lot of data is being needlessly copied.

I’ve used both styles in the past (mostly between the late 90’s to early 2010’s, real life slowed down my work on such things the past 6 or so years), and archetype style is faster in certain styles of engines, however those engines are very inflexible when you start stepping outside their design as the performance hit becomes rather extreme, and if you are doing to do that pattern anyway it’s often better just to use the dataflow pattern straight as referenced in my prior post as it will have even less overhead.

They are definitely not similar as in archetype based you are moving all components of an entity and in a group based you are only moving referenced grouping that is not already ordered (which if something keeps getting enabled/disabled it’s often already in order anyway) on significantly fewer components.

Not at all, components being added and removed is still extremely fast even when they are in groups. Only the groups the component is in is touched at all and it will only touch the components for those groups, which is generally always a significant memory amount different between an archetype model and a grouped model.

Such an indirection is not that expensive either (it is in mine because I’m not using unsafe code, though in an actual release then unsafe code would be used in a few of these parts to remove useless bounds checks); in the standard group view there are no conditionals, just a couple of offsets (additions) and a lookup in the index then main array. Specs for example does the lookups faster thanks to use of unsafe on certain calls (and I could get faster than it as this system knows more information about the structure). Even then secondary index lookups are comparatively rare as the hot paths should always be owned groups, which is a pattern that holds in users of EnTT (both games and engines based on EnTT).

There are quite a significant number of realistic situations where the grouped design is faster when using high frequency marker types (do note, for a ZST marker type the secondary index ‘is’ the test store of if it exists or not, there is no other data stored so there is no indirection either, or if you only want to know if it exists without actually accessing the data as well) as there is significantly less data to copy on change as well as iteration is still faster than an archetype style due to significantly lower stride length (stride length == data size in a grouped ECS, stride length is significantly larger than data size in an archetype ECS, which is why the iteration benchmarks always show grouped is faster than archetype), among other design choices that can be made in certain situations that can’t be done in archetype styles.

Using Option has a three-fold speed hit:

  1. You are iterating over things you don’t care about.
  2. You are adding in a conditional to check if it is None or Some when a conditional is not needed (and a potentially very random conditional, so that’s awesome to blow the CPU pipeline out).
  3. You have yet even more data that you don’t care about to stride over in every-single-other-iteration elsewhere.

Transformations include rotation already, that’s part of being a transformation. Unless you meant a Translation component (linear translations without rotation or scale), which then means you have gotta pack it up into a transformation matrix anyway for rendering, which just makes that slower (and translating a transformation matrix isn’t any slower than translating a vector in entity space). I’ve never seen rotation separated from translation, I’ve always only seen it packed into a transformation matrix (generally with a changed flag in non-ECS’s or ECS’s that don’t have update indicators for updating to the GPU on change, and this step is much more costly when the values are separated).

Entities with a position would not have acceleration and velocity integrated. Combining Position and Orientation and Scale into a single Transformation component (say, a 4x4 matrix) does not mean combining in Velocity and Acceleration (which those two themselves would also generally be combined into a single PhysicsMotion component or so). And making a group of Position/Orientation/Transform just means that you have all the information needed to render the object, then making another group of Position/Orientation/Transform/Velocity/Acceleration just means that you can perform physics over everything that needs to have physics done (else remove the Velocity/Acceleration)

Where you say:

All your entities with a position are now having their acceleration and velocity integrated, even though most entities in a scene are static and will have 0 velocity.

Makes absolutely no sense or I’m confused about what you are saying. A grouping of Position/Orientation/Transform/Velocity/Acceleration does not mean that Position/Orientation/Transform is slower, and you absolutely should not have Velocity/Acceleration on things that are not moving at all. Remember that subgroups are just that, a group within a group, and since Position/Orientation/Transform already contains everything that Position/Orientation/Transform/Velocity/Acceleration has then Position/Orientation/Transform/Velocity/Acceleration can just sort those Position/Orientation/Transform/Velocity/Acceleration entities within the group (a sorted region within another group, which is just a sorted region in the overall thing, or as many layers as you want). It does not mean they are connected in any other way.

Considering you are saying things like:

All your entities with a position are now having their acceleration and velocity integrated, even though most entities in a scene are static and will have 0 velocity.

Makes me think there is a failure in understanding how grouping works. Entities that don’t move shouldn’t have such physics components, and I don’t understand why you would require them on everything with positions?

For something like, say, a Block Position in minecraft, you’d absolutely not have rotation (well, maybe a 24 state rotation for rendering and interaction purposes, but that’s parallel to this), and as such it would be it’s own component distinct from entities that have a full transformation (like a zombie). Even if for some reason someone gave an integral block position a full floating point transformation matrix the data stride still significantly easily fits into cache with loading times of less than 100ns per cache line load, which will happen much faster than working over that data considering it’s only 64 bytes, though you definitely should not combine those as they are conceptually very different in the game.

In addition, even on my 13 year old desktop CPU, it’s L1 cache stride length is 128 bytes that takes 3 cycles to fill, much faster than it matters, and newer CPU’s are much much better, especially for aligned data such as these.

Which is precisely what archetype does wrong, all data is packed together with huge stride sizes. This is shown in benchmark case 8:

Entity-Iterate-10000/EnRS-OwningGroup/8                                                                           
                        time:   [71.216 us 71.800 us 72.502 us]
Entity-Iterate-10000/Legion/8
                        time:   [106.27 us 107.23 us 108.25 us]

(107.23:71.8 is almost 50% faster for the owned groups)

Which should be an entirely optimal case for legion, the reason’s mine is faster is:

  1. Significantly smaller stride length as I can actually get loaded into cache just what I want.
  2. I don’t have to test the component sets on the chunks for disparate archetypes (not an issue on this test as every entity has identical components, which is what case 7 shows by having 2^5 (32) different archetypes:
Entity-Iterate-10000/EnRS-OwningGroup/7
                        time:   [1.6935 us 1.6998 us 1.7059 us]
Entity-Iterate-10000/Legion/7
                        time:   [3.5821 us 3.6087 us 3.6373 us]

(3.6087:1.6998, which is ~212% faster)

This is a very important bit about archetype ECS’s:

And as you add more archetypes (of which there would be thousands if not tens of thousands in some of my engines then it will just get slower and slower uniformly throughout, even without doing anything different. Archetypes doesn’t matter for owned groups, what does matter is how good the programmer can structure the data (which isn’t at all hard once used to it). In comparison even the slowest of all accesses in grouped systems (secondary index on a completely unsorted pool) is still only constant (as are all the other accesses).

Excepting my above objection, which largely applies to this group layout, this is only possible in a very limited case where you can arrange your two loops such that one is running over a superset of the other.

This is an extremely common case in mine and others experiences in many games and engines (I invite you to the EnTT Gitter chat). Sure the programmer has to actually think about how to structure their data better, but that’s what they have to do anyway, and in exchange they get faster speed and more direct code.

What about the other hundred loops in the game which access position? What about your AI code, gameplay logic, rendering, etc?

This is actually why the transformation components are usually at the ‘bottom’ of a set of groups, so that you can group over them with a huge variety of other components as well, gaining perfect iteration with perfect strides the whole way. You can structure a surprisingly huge amount of the engine this way. :slight_smile:

The ‘owned’ bit is very common. You can have many subgroups, not just a single layer, but potentially many, in addition looking up in a secondary index cross groups only involves a single indirection for anything accessed with that group, not one per component. In my benchmarks the ownedgroup test and the indirect test are literally the two extremes, best case and worst case scenarios, in reality it will fall between and most often very near the ownedgroup side as the hotpaths in the program should always be owned groups. It is under the programmers control instead of just hoping that something else takes care of it with its constantly growing access cost as the archetype count and stride sizes increase.

It is impossible to optimise for all cases, but the grouped design will eke out a small amount of extra performance in one (or two) loops at the expense of performing terrible everywhere else.

Not one or two, the significant amount of loops will have full owned performance. If only one or two out of the usually hundreds of iterations are owned then the engine is entirely backwards designed. Take for example a minecraft-like engine, you wouldn’t have a, say, a tube block constantly asking for the block at the next position, that would be extremely slow, not even the mods of Minecraft do that as it is dreadfully slow, they cache the pointer to the TE that sits beside them, just as you’d do in an ECS style version of the engine as well, you’d hold the entity that sits next to you and update it when it changes, just as MC mods do now.

Yes, one order, and that is because I’m using completely safe rust even when I absolutely know that accesses have no bounds issues but the compiler absolutely does not know that (confirmed by looking at the assembly). There are multiple things in these paths that I can do to speed them up by using unsafe that is entirely safe, I’m #![forbid(unsafe_code)] right now to speed it up as much as I can that way first, then and only then will I add unsafe code in areas I know to be entirely safe. The fact it’s already this speed when it’s hitting secondary indexes when it doesn’t need to be and more is still astounding to me, although the C++ version is still a lot faster on those accesses, at worst I should be able to match its speed (although my owned group access is faster than in C++, likely due to Rust’s aliasing guarantees). Even if they were never ever optimized they are still entirely sufficient for the non-hot-paths, else groups should be used.

That is certainly not how any of it is described in the EnTT blog posts.

Wait what? Can you link that post? I know his blogs talk about a lot of Ideas and not necessarily his implementation, but when his blog posts speak of EnTT specifically he’s usually quite clear about it, so definitely link the article and where at in the article and I’ll tell him about it.

If you do not keep the components that are part of a group sync’d in the same order for each entity, then you cannot iterate through the slices together.

The ‘entire’ pools aren’t kept synced, only the (in my and EnTT’s implementations) the ‘end’ of the packed pools are kept sorted, only the part that the group cares about is sorted at all, the rest can be sorted an entirely different way or left unsorted.

just the same as if an archetypal ECS somehow had a unique layout for literally every entity.

Which can happen with enough permutations of enough components. ^.^

Unowned groups are fairly rare, generally you will have some combination of owned and unowned (and excluded), but even in a pure unowned case the group still knows beyond any shadow of a doubt that the pools contain the entities, so the second index lookup becomes pure math, no conditionals, no CPU pipelining failures (although in my pure safe version there are bounds check, however the rest of my lookup code has no conditionals on those cases, I haven’t shown unowned benchmarks yet, need to make the coding API pretty for them still), which becomes ‘almost’ as fast as pure iterated for low component counts and can be as worse as 100cycle load per component access on them for the worst case of accessing components in a huge pool that doesn’t entirely fit in cache and accessing it in such a way so you only access what’s not in cache, which in real life is actually crazy rare, it’s normally fairly close to owned iteration times within a multiple or two on average.

So even pure unowned groups are a great deal faster than the worst case of an indirection lookup and test for existence every time like my Indirect test does (quite literally the worst case access in entirely safe rust, lots and lots of conditionals and tests and lookups all over the place, I’m honestly surprised at how fast it is already, it definitely has quite a number of optimization opportunities with unsafe code).

For the AI systems I’ve dealt with in the past there were around 150 components just for handling AI, significantly not 12 (even minecraft, which is not an ECS, allocates almost a hundred classes to handle its comparatively meager AI, each mob type has a different ‘archetype’ of them, which in modded worlds is many hundreds in most setups, or in some cases like infernal mobs can be thousands to tens of thousands of ‘archetypes’ as the effects it adds are dozens of different AI handlers that can be added in any permutation randomly on any spawned entity).

I’m not much for FPS games but I would imagine a chair like a Static entity with no functionality, sure it’s common but it’s not where the cost is going to be as it does nothing anyway until, say, interacted with or receives a physics event or so (which I’d imagine would then ‘add’ physics functionality on it, rather then giving it from the start and letting it sit there eating its tiny bit of the physics simulation, but eh, I’m not an FPS’er so unsure what is common there). And nicely these kinds of things would be in an owned subgroup, so they’d be iterated perfectly, no chunk jumps or anything needed (although why they’d need to be iterated all over I’m unsure…).

Even highly dynamic entities do not typically go through component layout changes.

Highly dynamic entities are generally swapping components on and off them all the time for various state handling. Like for an FPS I’d imagine that, for example, there is a Health component, that holds their max health, and a Damaged component that doesn’t exist when they aren’t damaged, that component would hold how damaged they are and perhaps when they last took damage. There could be a system that iterates over entities that have the Health and Damaged component but not the Regenning component and check the time, if it’s more than, say, 2 seconds (or whatever FPS games time for HP regeneration to start nowadays) then add a Regenning component, of which another system operates over entities with a Health, Damaged, and Regenning component and heals the damage until it’s fully healed, at which point it removes the damaged and regenning components.

And that example is just one of many many that I would forsee being on just a normal FPS player entity. All kinds of events and states and actions should be components, so systems only operate on them if they are interested in them, you want to remove as many conditional checks as possible, especially the more random ones, and reduce memory loading (which an archetype ECS fails horribly horribly at due to loading huge amounts of unwanted data into the cache due to the stride lines).

I tend to make sandboxes like heavily modded minecraft (though mostly in 2d, only ever made a 3d one with procedurally generated cubic planets and solar systems and such down to life on each thing) or have made factorio style things long in the past. Mostly I’ve made self-running simulations, I like watching things unfold and evolve under their own code and genetic algorithms. ^.^

So yeah, perhaps not typical, but intensely fun for me and my friends and family. I need to get my old CVS server running again to pull off its old projects and migrate my SVN server to git someday…

Quite true.

1 Like
(OvermindDL1) #5

I was curious, and it’s been a month since I ran perf, sooo… ^.^

On case 8:

Entity-Iterate-10000/EnRS-OwningGroup/8
                        time:   [71.765 us 73.013 us 74.423 us]
Entity-Iterate-10000/Legion/8
                        time:   [109.76 us 111.80 us 115.01 us]

Some interesting things:

perf EnRS Legion
L1-dcache-loads 18,420,844,027 28,100,746,684
stalled-cycles-frontend 937,794,276 3,618,488,369
Instructions 27,829,811,094 39,721,454,618

So immediately we see that Legion loads a significant amount more data, it’s cache miss is less by percentage but because it’s so much higher this is the major cause of its slowdown, this is the Stride issue that I was talking about, Legion as it is programmed now is causing the CPU to load into its cache all the data in a chunk rather than just the data that is wanted. Each cache miss is stalling the CPU pipeline and adding over 70ns of cost to each miss.

Legion is also running a lot more instructions, I’m curious about that… Even with hopping around chunks I wouldn’t think it would be that high…


Hmm, let’s test the immutable version of case 8, case 10, ah much better, almost identical instruction count and all even though the results were still quite far apart considering their instructions run were almost identical:

Entity-Iterate-10000/EnRS-OwningGroup/10
                        time:   [34.801 us 35.111 us 35.551 us]
Entity-Iterate-10000/Legion/10
                        time:   [53.385 us 54.338 us 55.207 us]

Some interesting things:

perf EnRS Legion
stalled-cycles-frontend 1,288,481,455 3,814,114,809
Instructions 51,154,752,370 52,262,857,959

So the stalled cycles is still huge for some reason…

Let me take some instruction recordings to see differences between them. First I’ll do EnRS since it’s faster here:

Wow! At 18.5% my most costly call is _$LT$u32$u20$as$u20$enrs..entity..entity..Entity$GT$::idx::h7da2a1b301823677, however that’s in the setup of the benchmark and not the benchmark itself (and this is some of the safe rust code that is one of the things that would be faster in unsafe), constraining it to just being within the area of the benchmark instructions it looks like my most costly within the closure of the iteration itself, it was all inlined… The iteration part took 4.05% of the total benchmark runtime.

Same with Legion, it was also all inlined… The iteration part took 2.29% of the total benchmark runtime.

Makes it hard to get per-instruction costs, hmm, can keep looking later, busy at work again.

(OvermindDL1) #6

My mistake! Legion does group components together and not entities, the much unsafe code was hard to read back when I was newer to rust. :sweat_smile:

In that case that’s one performance hit that legion shouldn’t have… So why is it slower…

(Thomas Gillen) #7

I wrote this reply yesterday, but the forums went down as I tried to post it:

No, it doesn’t. Component slices are aligned to cache lines and are not interleaved with any data that is not needed. A query which does not access a component does not touch that memory at all. Different component types do not ever occupy the same cache lines. Fetching one component does not cause the CPU to load other components out of memory. They do not need to be strided over.

None of those are things which are added or removed from a large number of pre-existing entities each frame. It is a tiny, tiny percentage of the entities which are processed each frame which go through a layout-changing state transition.

That is precisely the issue I have with grouped ECS designs. Only loops which match an owned group will access all components in-order. Anything else and you totally destroy your performance, and major loops with incompatible group requirements are extremely common.

Only if you are happy with ordering not being maintained across components in different groups, which results in out-of-order access in any loop which needs to join groups or access ungrouped components - totally destroying the performance of such loops.

This is the central issue I have with the design; you need to be extremely careful about what data your loops access and how you define your groups, else you are hit with massive performance penalties. The penalty for missing this is similar to the most extreme archetypal fragmentation possible - as if every entity had a unique layout. Given that there are strict limitations on what groups can exist concurrently, avoiding this pitfall everywhere is just impossible.

Only owned groups avoid this. Unowned groups only assist with avoiding overheads which already never exist in an archetypal model, and still suffer from poor performance.

Avoiding bounds checks is a totally inconsequential performance advantage. The major performance hit comes from the “and a lookup in the index then main array”. Those lookups are only ordered if you are looping through an owned group.

Only if your components are far too broadly defined, which is precisely what you suggest below.

No it isn’t. Stride length is the component size. The issue is that an archetype will perform one hop to a new memory location, and likely incur one cache miss, each time it moves to a new archetype. The cost of that single cache miss is amortised over all entities in the archetype. You can model this overhead as one cache miss per n entities, where n is the average number of entities accessed by a specific loop. As long as n is enough to run through a few cache lines (which only needs to be a handful of entities), then this overhead is going to be negligible.

EnRS, on the other hand, does not cause any additional cache misses if you are looping through an owned group. In any other case, it is likely to perform one jump to a new memory location for every component. This is why it runs an order of magnitude slower when this happens.

The performance difference between legion (in every benchmark) and EnRS (in optimal owned group cases) is small enough to be caused by varying levels of optimisation in either library. For example, legion (on master) performs full archetype filtering on every execution of its loop. This can be cached, causing every loop to simply index into a component slice for each archetype and then linearly walk through the slice. Neither library is consistently faster than the other across all tests.

  1. You are iterating linearly through components you don’t care about, which can easily be better than iterating randomly through only those you do. Which is true depends on how effectively you can cull the entity set with other component accesses or filter conditions. You will only get good performance from adding or removing components here if the loop is through owned groups - which for state flags plus the data you actually need to process is just never going to happen due to the number of incompatible permutations that you need to put into different owned groups.
  2. A cache miss due to unordered access is hundreds of times slower than a branch misprediction.
  3. Not true. I don’t know why you think this is the case.

A position is an (x, y, z) vector. Most code which needs to know the position of an entity does not need to know about the entity’s orientation, and certainly not the full transformation matrix. Passing a full transformation matrix around will cause you to load just over 5 times as much data as you actually need. That is the antithesis of the goals of using an ECS and of data oriented design. Instead, you use a minimal Position(x, y, z) component attached to entities which have a position. You might also attach Rotation(x, y, z, w) and Scale(x, y, z) components to entities which actually need that. But not to entities which don’t. Entities which need a transformation matrix, e.g. ones with a model to render, will also have a Transform(matrix4x4) component. But not those which don’t. Some might also have a Parent(Entity) component, if they exist as part of a hierarchy. But not those at the root. A system will dispatch loops to calculate the transformation matrix based upon the canonical position/scale/rotation/parent before rendering is performed, with one loop for each component permutation. You don’t need to perform all the trigonometry calculations for a full transformation matrix if the entity in question does not have any rotation, and as most of your entities are not gaining or losing the ability to be rotated every frame, you should be able to rely on the filtering capabilities of the ECS to select which case you are processing.

The important part here is a) leverage the ECS to select the minimal amount of entities needed for each piece of work and b) don’t load any data you don’t need; velocity integration only needs 3 floats for position, not a 16 float matrix.

You can’t do either if you are dumping all somewhat related pieces of data into a single component. The lost performance due to executing a larger loop body than needed for many entities because your loops are not selecting with fine enough granularity will totally blow away the performance difference between owned group loops vs archetype loops.

But this brings us back to the major concern that I have with grouped ECS. If you split your types up to perform sufficiently granular loops as you ideally should, then you massively increase the number of component permutations you are looping over, and thus the number of owned group permutations you need to be able to define to maintain high performance in-order access. There is an extremely high probability that if you try and use the ECS like this, you will have many loops which try and access e.g. (Position, Rotation, Transform) and (Position, Scale, Transform) - which are two groups which cannot both be owned groups at the same time. At best you might be able to have one or two of these loops leverage owned groups, and hopefully the rest use unowned components for some of the variations - but these unowned loops perform very slow out-of-order accesses. The performance of which is significantly worse than any plausible case in an archetypal ECS.

The grouped ECS forces you to write suboptimal ECS code which does not take full advantage of the potential of the system, in order to stay on the “fast path” of owned groups.

I understand how subgroups work. I was responding to:

… in which you suggest adding Velocity and Acceleration to all “positionable” entities. I explained why you should not be doing that.

My objection to the subgroup layout you suggest is the same as I explained in the paragraph below that one. You might be able to add motion data to position data to optimise both the transform update loop and the velocity integration loop… but in most cases of loops wanting position they do not neatly form supersets of their desired components with one another, so you simply cannot optimise all of them. Any that are left out of having a proper owned group are going to perform poorly and many of those will be major loops in the game.

Not to mention that even for the transform and velocity loops, your proposed group layout is far from ideal; entities with positions and velocities will now only actually move if they also have a transform… even though transform is not used by this logic. Entities without transforms won’t be in these groups at all. So what this loop actually wants is a group which owns (Position, Velocity) - which now means any other group you want to define which contains an owned position (which you need if you want those loops to be fast) must also require that the entity has a Velocity, because of the restrictions on how subgroups can be defined.

The ECS is forcing you to both execute a more complex loop body than your entity really needs (due to putting things like rotation on everything with a transform), loading more data than you need (by packing too much data into larger components), and is causing nonsensical behaviour such as a point mass entity with just a position and velocity won’t move unless you also attach things like an orientation and transform matrix to the entity. You do this because if you don’t, the ECS performs terribly.

The issue is not so much whether or not a component can fit into a cache line, but rather if your component is twice as large, you can fit half as many into a cache line. So you will need to fetch twice as many cache lines, and that generally results in an increased number of misses (even on linear access). Also, 100ns is a very long time to process only a single matrix - hence why a cache miss can so easily dominate CPU cycles when they happen. A cache miss can sometimes cost you a hundred loop iterations.

L1 cache misses are a total non-issue here. Fetching from L2 on a L1 miss is fast enough that is is not a major concern and you will get good L1 hit rates without optimising your code. The concern are L2 misses which require the CPU to go out to main memory. Modern CPUs are actually worse at this, as memory speeds have not improved nearly as fast over those 13 years as CPU speeds, so the cost of a miss in terms of lost CPU cycles has gotten higher.

Given that you are misunderstanding something about how legion organises component data - it does not stride over unused components - this difference is most likely caused by legion paginating its memory into chunks. There are pros and cons to doing this which are entirely orthogonal to either ECS architecture and the page size in legion is not something which has been carefully tuned.

There is a critical mistake being made by trying so hard to remove the cache misses caused by jumping between archetypes:

The total execution time of a loop scales primarily by the number of entities accessed.

The consequence of this is that performance is impacted far harder by changes which affect the per entity cost, than those which scale independently of entity count. Archetypes scale highly sub-linear with entity count, and you see a smaller and smaller increase in archetype count as you ramp up to higher entity counts.

This is why it is only easily demonstrate examples of EnRS significantly out-performing legion when you are working on loops which access a small number of entities and a test which tries to include a large number of layouts in the result set; these are cases where the number of archetypes is significant relative to the loop size. These loops are not the performance critical loops in your game. In fact, such loops are unlikely to be those you priorities to have their own owned groups, so a benchmark against unowned groups (or none) would be more accurate here.

However, EnRS chooses to remove this sub-linear scaling cost from certain loops. The cost for this is that when those optimisations fail (which I contend above will either happen a lot or you have to not use the ECS to its potential), you instead incur out-of-order accesses. These accesses are a cost incurred per entity.

The more important a loop is to the overall performance of the game, the harder the penalty is for not giving it an owned group. Conversely, for an archetypal ecs, the larger a loop is the less of an impact archetype jumps have.

It seems like you can only hit a decent portion of the “hot paths” in the game by designing highly sub-optimal component sets and running loops which select over both more components and more entities than they really need to.

Yes, that is what I was talking about. The components which are actually part of the group must be kept sorted. It is explained in part 2:

All what is needed now is to literally move the entity e7 within the group.
To do that, we can swap the entity after the one pointed by the group with the one to which we’ve just attached the component, then advance the group. This must be done for both the sparse sets

By construction, the sparse sets contain the entities that are part of the group at the top of their packed arrays, all ordered the same.

Furthermore, the sparse sets involved by the group will be such that the first N elements of their packed arrays also contain the entities and components of interest, ordered the same.

This is, surely, the property of owned groups which allows them to be iterated through in-order without any indirection?

From what I can see, this appears to only hold true because the limitations on what owned groups can co-exist has caused you to structure your game within those limits - else performance would tank. The problem being that this encourages you to significantly under-utilise what an ECS might otherwise be able to give you.

You give examples of the rewritten version of minecraft as proof that grouped ECS designs can be workable with good it rates on owned groups. But if minecraft has been designed with groups as you have suggested, then it is massively under-utilising the potential benefits of ECS entity filtering and queries. That you can get good owned group hit rates on major loops at the cost of poorly specified loops and component designs does not demonstrate that you would have been worse off had the game been built with an archetypal ECS from the start instead.

Pipeline stalls from branch mispredictions will be totally overwhelmed by the stalls caused by a L2 cache miss. If you are accessing components out of order, then this will happen a lot. It will be very similar to the performance you would see from an archetypal ECS if every single entity had a unique layout (which is totally unrealistic), except it will reliably do this for any loop over an unowned group.

I was talking about the number of components and layout variations caused by enabling pathfinding on an entity, not AI in its entirety.

None of these state changes occur frequently enough that you see many in a single frame. That was the point I was trying to make. A single entity will be iterated over several dozen times each frame, and will usually last for many frames for each state transition that it goes through. Iteration performance therefore is massively more important than component add/remove. This is even more true if you are expressing transient events as entities, rather than as layout mutations.

(Michele) #8

Since I’ve been invited by an user of this community to comment, I’ll do my best to add something valuable to the thread.
However, I’ll try to avoid falling into the trap of my solution is better than yours for . I’ve ran into this discussion several times and I don’t think it’s any way constructive if put in these terms.


If I can say something but without wanting to offend anyone, from what I read it seems that @Ayfid doesn’t know well how a full-featured system based on sparse sets and grouping functionalities work (or has used poorly refined implementations at least), since many comments or criticisms are quite odds and easy to reject.
On the other hand, I’m also pretty sure that I would make the same mistakes if I spoke of worst or best cases in the use of the model that is more convenient for him, for the exact same reasons.
Probably the same is happening to @OvermindDL1 but I can’t spot his errors because of a lack of experience with Legion.

Instead, it was very interesting to read how, according to each of you, one solution is better than the other just because you’re used to doing things in one way rather than another. This is precisely what makes them both good enough.
Let’s make a very stupid example. Imagine I’m used to use my component channel (the registry in EnTT terminology) to drive most of the functionalities or at least those for which it makes sense to an extent. Let’s say one of them is a messaging layer. Therefore I can trigger, consume and destroy tons of messages (from zero to thousands) during any tick (it depends on the application ofc) and they can be attached to any element, as well as to standalone entities.
It would be annoying to do that with a solution that moves all components around every time, right? I should use only standalone entities at least, then find a way to refer other elements. Does it make that model worse than the other? No, probably the guy that is reading this message right now just dislikes the idea of using entities and components for messaging, so who cares.

The best solution is the one that fits better your mental patterns and coding style.
@Ayfid is clearly more oriented to almost-fixed archetypes that don’t change much during execution. Definitely not a messaging systems designed in terms of entities and component. I can see it as an approach, not the one I prefer but surely a widely used one and there is nothing wrong in it.
I’m more similar to @OvermindDL1 instead and I like to use entities and components in a more fancy way.
It’s funny to see how you’re trying to convince each other that your way to do things is the right one or rather the only one. It is not, that’s all. :slight_smile:
From what I see you’re both defending a solution and a position with very subjective arguments, based on your (immensely larger than mine) experience and really valid in my opinion.
The basic mistake is in trying to explain why one solution is better than the other, which you’re persistently trying to do. It would be enough to recognize that both approaches are excellent and have different pros and cons but neither is really perfect.

For example, does one or the other allow to sort elements? Yes. Good, so that one is better? No, unless you need or plan to use that feature and know how to do it without hurting yourself, this is completely irrelevant.
What about memory allocation instead? In one case you’ve fixed size chunks that are really easy to use when it comes to working with MT but you cannot (easily) have a per-type allocation strategy that doesn’t ruin the performance, that is what I prefer even though that requires me to spend more on doing MT right. One is even better suited for uses with virtual memory while the other can be tricky to adapt but what if you don’t plan to use it at all?
Should we then move the discussion on what’s the best allocation strategy? The neverending post. :smile:

Ok, let’s call of iteration times then. Using 3 components (I plan to consider a common case, not what happens when you iterate 10 types together, that’s fine only for benchmarks probably) we appreciate some differences when we iterate 1M entities, at least from what I’ve seen in the C++ community when the different implementations have been compared. All of this from ad-hoc benchmarks that have nothing to do with real world cases.
Of course, a group performs better. Numbers and theory tell us. There is nothing to add here. Put aside ad-hoc, pointless benchmarks, real world cases spread instances of a same type on different chunks and different archetypes. This introduces more jumps if compared to iterating a contiguous array but the difference is really irrelevant and we should take the number in the previous post for what they are.
Similarly, when you’ve a single free type and you can constrain all others, everything slows down a bit and the two solutions are comparable. It goes without saying that the results are at the opposite when the number of unconstrained types grew up. Most likely, the sum of the times will be the same at the end of the day, regardless of where you spent it.
What about random access instead? Straightforward in one case (sparse sets), trickier but easy to optimize and make comparable in terms of performance in the other.
All in all, the two solutions are in the same ballpark and it’s really unlikely iterations will ever be your bottleneck in a real world software when you reach these numbers. That’s why you’re debating on the number of components you have in a query, how many archetypes you’re going to create (will you have 100 or 100000 different patterns?) and when and how you can or cannot use subgroups or partial owning groups but this is… well, you know. :slight_smile:

So, construction and destruction of components? What about it? What can you tell us?
It makes me laugh when I read that it’s not a problem with archetypes because you can do batch creation. That’s true, batch creation can be slightly faster in this case but I do many more sporadic assignments and deletions in my software, so it can be a problem if they don’t perform well. Is it the case? I don’t think so.
Just recognize that the archetype model is way slower than any other solution when adding/removing components to/from entities because it moves everything around and accept it. If I know that a model has some issues, I can avoid them and do things differently. Just tell me about them, don’t try to minimize.
I’ve seen many people that just load a bunch of elements at fixed points in time and never construct/destroy entities and components for a while. Svelto.ECS is entirely designed on a similar concept. If you’re not one of those guys that want to use the component channel for example to dispatch messages, probably you’ll never have a problem from this. Creating a component is slower with archetypes but… come on! It doesn’t mean that it costs 3ms every instance, you’ve still to create or destroy tons of things every tick to appreciate the difference anyway. When we say slower, we mean sub-sub-sub-sub-…-sub ms for one component, it’s irrelevant!
Of course, if you plan to make a game and want to spawn and destroy thousands and thousands of entities and components every tick, probably an archetype based model isn’t the right solution or you should reconsider your design at least. So, again, just make it clear rather than saying we can optimize a lot and you can use batch creation and bla bla bla. It’s slower, that’s all, it’s not a problem but within certain limits, tell users what these limits are and that’s all.


If it seems to you that I dislike archetype based models from what I’ve said, continue to read, please! I’m going to amaze you! :smile:
It’s true, I don’t like that model because of the way I write code, I find that sparse sets and grouping functionalities are by far a better fit for me. The two solutions aren’t comparable, archetypes aren’t suitable for what I want. It’s me though, my personal opinion and this doesn’t mean that an archetype based model can’t be the right choice for everyone else.

That said, I find that archetypes have an outstanding feature that groups cannot offer any time soon.
The latter require the user to know how to define them and therefore to know what he/she is doing while the former has this exquisite approach for which we try to optimize everything a bit, since there are certainly also user’s access patterns in there.
In other terms, sparse sets and grouping functionalities don’t optimize anything for you, you must specify what are your usage patterns. Archetypes optimize also the patterns in which you aren’t interested and this has a cost you pay somewhere else but for sure they optimize for free also your usage patterns, no need to tell them what they are.

This is not to be underestimated. You’re discussing about a solution for a general purpose engine. This feature is really valuable in this case. Not something I’d want (why on the Earth you optimize also what I don’t want to use?) but probably something many people want.
I wouldn’t even take it in consideration for an in-house engine or for a custom game because I know what I’m doing usually but it’s a really nice-to-have feature for a general purpose tool and groups won’t give you this. It means that groups aren’t that (let me say) noob-friendly and this can be scaring for newcomers.

My two cents.


Please, don’t take this comment as an invitation to revive the discussion and try to show that one thing is better than the other.
I find both solutions interesting and valid, I made my choice for reasons that I hope are clear and I don’t want to enter this type of battles again (unfortunately I’ve done it too many times). The fact that I made a choice doesn’t mean that I find the other approach stupid or shitty. It’s just that the model I picked up has some features that the other has not and those are more valuable for me.

If you’ve any question I am willing to reply but I don’t want to be drawn into a war of numbers, if and but just because. :slight_smile:

2 Likes
(Thomas Gillen) #9

1.7us and 3.6us are such small timings that I would expect both of those execution times are dominated by one-time setup costs for each loop. I might expect that, except that actually the ~2us difference is quite similar to the expected overhead of jumping through ~20 archetypes. If these loops were pulling every component of all 10k entities, that ~2us overhead from archetype jumps would be approximately the same. That is an example of what I mean when I say that the performance loss from archetype fragmentation scales highly sub-linearly, and becomes less and less significant the more significant the loop becomes to the overall performance of the game.

Your perf results might in part be caused by legion not caching archetype search results for a loop. Although even then your tests don’t have that many archetypes for it to search through, so it shouldn’t account for that big a difference.

I have had some difficulty doing proper benchmarking and tuning because everything gets so aggressively inlined that it is not obvious where time is being spent. Of course, the inlining is intentional - removing some of the #[inline]s on the iterators hits performance significantly, but that means doing so also gives innacurate profiler results.

(Thomas Gillen) #10

I think a major part of the disagreement here is that I don’t think wanting to read ~10 components is terribly unusual at all. 3 components is just about the smallest loop you would realistically encounter.

Judging by the group layouts @OvermindDL1 has suggested, it seems to me that the limitations on what kind of owned groups can co-exist has caused you to design your ECS code around grouping together related components into larger components, and avoiding making heavy usage of the ECS to select minimal loop workloads, and perhaps even requiring additional components in a loop so that the loop can access a sub-(or super-)group of another optimised loop.

This seems to be the core of why the performance cost to having a loop access non-owned group components concerns me so much. The primary concern when I decide how to divide up entity data into components is how those components capture each permutation of operations I am going to perform on those entities. The components are designed such that each loop will only access the data it needs, no more, and that the loops only iterate over the very minimal set of entities which actually need that behaviour. My examples of omitting the trigonometry calculations for transform matrices when the entity never rotates is but one of many such cases. Work that is never performed is far faster than a potential cache miss every dozen-to-hundred entities.

The issue here is that if you try and maximise this, you do inevitably end up with wanting many incompatible owned groups across many large performance-critical loops. When you provide examples of games which were specifically designed from the start under the limitations of what owned groups can be defined, and point out that those games achieve high group hit rates in their most important loops, this does not in any way show that you would not have been able to achieve similar or superior performance had the game been built from the start with an archetypal ECS.

The pros and cons of each would have resulted in different game code. An archetypal ECS would not rely heavily on adding or removing components from live entities, and would minimise the number of components attached to “event entites” (which are heavily used). On the other hand, it would lean much more heavily on being able to use finely grained selection criteria for every loop, resulting in less work performed overall and less memory being loaded by the CPU during iterations.

Aside from that, I am also confused as to why an archetypal ECS would have trouble with custom component storage. The only requirement is that components in an archetype can be iterated through in the same order as the other components also attached to the archetype. This is the same requirement as exists within owned components in a group (archetypes are sort-of-like implicit owned groups). Components are not stored interleaved with other component types - they each have their own separate storage. The main reason why I have not already implemented this in legion already is that the most ergonomic way for the user to configure this would be via specialization (by overriding the Component trait implementation on the component type), but specialization is still not available on stable rust.


I have been toying with implementing owned groups in legion. It would not be a “best of both worlds” though, as ordering within archetype slices of the component data would still need to be maintained, so insertions times of components which belong to a group would actually be significantly higher than they already are as it would need to shift all of the other components over instead of just swapping (although ungrouped components would not need to be moved as they are stored separately). On the other hand, it would allow you to optimise certain loops and achieve the same iteration performance for such loops as EnTT/EnRS gets, but still maintain legion’s current performance for unoptimised loops. I am not yet sure if this is worth it. There is no performance disadvantage relative to how things currently perform in the sense that groups are opt-in and have no impact on other entities/components/loops, but it would increase code complexity inside legion so it would only make sense to implement if enough benefit could be demonstrated.

(Michele) #11

Wut? 10 components? 3 (sometime 4) components is the largest amount of types I have, not the minimum. I’ve literally tons of single type iterations!
Ironically, I’ve had a call a while ago with a game company and they told me that using two components was almost a shame! :smile: They were always trying to do things with a single component iteration. Just to be clear, they aren’t using EnTT nor a group based implementation. The call was exactly to discuss about this topic.

The major part of disagreement imho is that you think/behave as if your experience is that of everyone else but this is not and there are a lot of person out there that just do things differently, not necessarily wrongly. :slight_smile:

Aha. No, really no. I first design components the way I like, then write code the way I like and rarely use groups in fact. Then and then and then and finally I profile and see where I can optimize things ,if I’ve bottlenecks or for some reasons want to optimize things.
I can assure you that I don’t design things in a specific way because I’m forced by a tool. Instead, I pick a tool because of how it fits with how I design things.

Tell me instead of when one finds itself iterating the archetypes and then the arrays within the archetypes with this nasty nested loop, no matter how deeply buried it is in a codebase…
Or when one is forced to design things in such a way that creating or destroying many things in a tick isn’t a requirement because its model doesn’t support it very well.
Or when one tries to limit the number of component patterns because it’s known that fragmentation is the worst thing that can happen to you with archetypes.
And so on…

See, we can accuse each other of something like - you do things in this way because .
Imho this is wrong in both cases. You know the limitations of your tool and avoid its pitfalls, I do the same my side.
However, this doesn’t mean that we are forced. I’m not actually thinking to it while coding. I’m just doing things the way I like and it happens that the tool I’m using fits well with my mental patterns.

What I don’t understand is why you’re concerned with non-owning groups. I’m even tempted to remove them from my solution since I never use them, really, never.
There is no case in which you don’t have at least a type to use (to own) for your iterations if you designed correctly your systems (at least in my experience) and rarely I’ve more than one free type in a group. A single free type ruins the performance enough to make them comparable with (rather than better than) that of an archetype based solution, so we are still far from having a performance problem here.
You iterate again and again on this point but I can’t really see the reasons, put aside theoretical worst cases that exist for all systems anyway.

When you provide examples of games which were specifically designed from the start under the limitations of what you can do with archetypes (since they have several limitations as any other system), and point out that those games achieve high performance in their most important loops, this does not in any way show that you would not have been able to achieve similar or superior performance had the game been built from the start with sparse sets and grouping functionalities.

How does this differ from the original sentence? It’s just true in both cases, no matter what.
Even the big array model of entityx is better than both archetypes and groups in some cases and for 95% of the games out there is more than enough.

This is the issue you had probably, not the issue everybody has. Probably grouping functionalities fit better my mental patterns than yours but I’ve never had this specific issue of many incompatible owned groups. Moreover, you keep speaking as if partially owned groups didn’t exist and just completely ignore what their performance are and how they compare to what you like, how they enter the game and drastically change your assumptions.
This invalidates your entire analysis, still not more than what ignoring creation/destruction and random accesses already do for example.

This is utterly wrong and (again) just ignores more than a good half of the problem, bringing us to odd conclusions for which there isn’t a proper explanation in the sentence itself.
Don’t take offense but I don’t think you’ve ever tried hard to consider or really use a group based model.

A fully chunk based model a-là Unity has a chunk that contains N entities and all their components. So, your allocation strategy is on a per-chunk basis. It doesn’t matter what the given chunk contains.
Even though you used archetypes with internal separated arrays rather than chunks, you would have a fragmented model that is by far different from what you can do with a sparse set based solution where all components are managed together.
Neither better nor worse, just way different and with different possibilities. I like more the second but it’s a matter of tastes.
I don’t get what’s unclear though, the differences are pretty obvious here.

As you said, you’ve been toying with implementing something in a library designed for something else.
I don’t think your experience with archetypes is as valuable as that with groups to be honest and the same applies for me (at the opposite of course).

As far as I can see, we can continue to tell each other that you’re wrong, no, you’re wrong, trying to explain this with faulty motivations but I’ve already said what I think of this approach.
And you just drawn me exactly in the kind of discussion I wanted to avoid, so please, just ignore this message and continue the fight with @OvermindDL1 :smile:

(Thomas Gillen) #12

If you are only looping through 3 or 4 components as most, then you certainly are not benefiting from what an ECS could potentially offer you. A single component iteration would be like looping through just position - there is nothing useful you can do with that. The game company which was frequently only iterating through single components most definitely was not used to using an ECS and was likely approaching the code from an OOP mindset. If you are only iterating through < 4 components, then your components are too broadly defined and you are going to be doing a lot more work in your loops than you could be doing otherwise.

Because they are inevitable if you try to leverage your ECS to minimise the amount of work your loops are doing. If you never do this, then your must not be taking full advantage of loop filters, or you are loading data and performing logic for entities which do not really need it.

I did not claim otherwise. You were the one to claim that archetypal models are slower, that this is somehow an irrefutable fact, and that minecraft somehow proves that. It doesn’t.

I have provided multiple examples where you can use tighter loop bounds with more granular components in order to perform substantially less work. Examples where trying to do this would result in incompatible owned groups. That you are used to and are comfortable with designing components which don’t take advantage of this does not mean that this performance potential does not exist.

Every example @OvermindDL1 has given for group layouts and component designs, every anecdote you have given, are all examples that you can, as I said, get high hit rates on owned groups if you are used to designing your code around group limitations. None of that means those layouts are optimal. None of those mean that you are not leaving potential performance on the table because you are used to designing within the constraints of the performance capabilities of your ECS design.

If you are frequently looping through single components, then you are with total certainty not doing data oriented design to its full potential. Rather than explaining how you could organise these examples in a grouped ECS in a way which gives both efficient linear iteration, while maintaining tight entity selection and without loading any unneeded data from oversized components… all I get in response is “nah you’re wrong”.

How is it wrong? I have already given concrete examples which demonstrate this earlier. Examples of multiple loops with incompatible groups, and @OvermindDL1’s solution only works for some of those loops, results in poor filtering and thus processing significantly more entities than needed, and requires the loops to load larger components which include data not needed by many loops.

You are not used to writing code which takes advantage of what you can do with an archetypal ECS because you don’t use an archetypal ECS. Just responding with “I don’t design my code to like that” (which is essentially the entirety of your post) is proving my point here.

As to insertion performance, I have been ignoring it because nobody has yet provided an example of how a typical game would perform enough insertions in a frame for it to be a performance issue. Insertions could be half the speed they already are in legion and the vast majority of games would not suffer for it.

I use components to indicate mutable state on entities all the time. But so few of these states actually change per-frame that it is literally never a performance issue.

The only situation where most games might try and insert so many entities that it causes a performance issue is when they are doing things like streaming in large sets of entities as they load map tiles or some such. This is one of the reasons legion chunks its storage, because it can perform extremely efficient world merges without risking a frame drop when the new entities are “activated”. You can run the entirety of your loading and initialisation logic on background threads, decoupled from the frame loop entirely, and then merge thousands of entities into the main world near instantly when ready. If you can do this, then the performance of inserting large numbers of entities in a single frame (which is the only case where insertion performance is important) is a total non-issue. It is not in the game’s hot path at all.

Whether or not you chunk storage as above is entirely orthogonal to using groups or archetypes, but it makes optimising for insertions much less important.

You stated that custom storage for each component is difficult in an archetypal model. It is not clear how that is true. It is no more difficult to have a custom storage implementation for each component type in an archetypal ECS than in a grouped ECS. Literally the only requirement is that it is possible to iterate the storage in entity order - which is the same requirement that grouped ECS needs for the components that are inside the owned group portion of their components. I am not sure why you bring up chunking and fragmentation issues in relation to this.

(Michele) #13

If your common case is a query with 8 or more types and 2/3 types are the exceptions, you’re either exaggerating with granularity or splitting your data in a very wrong way.

Similarly, if you can’t imagine a single type iteration, you’re just confirming that you don’t you certainly don’t know what benefits an ECS could potentially offer you.

Indeed no. The right approach here would have been to ask - why do they do that and how they get all data they need during a loop? There is an answer but since you didn’t ask and assumed the guys from this company are stupid, I assume you aren’t interested in why they were doing this and the reasons for which it could make sense in their design.

Again, the strangeness is where the common case is having queries with 8+ types, not the opposite.
This is the typical error of those that make their first attempt with splitting their data in components and exaggerate with granularity just in case.

First of all, I didn’t say anything about Minecraft since I signed an NDA with Mojang and I can’t tell anything about how they used EnTT. So I’m pretty sure it wasn’t me.

That said, I invite you to re-read what I wrote. I said that an archetype based model is slower at iterating stuff if compared to a group and the numbers tell this to us if we exclude pointless ad hoc benchmarks.
Let’s take a real world example and a single component to put things simple. Its likely that the instances of this component are spread over two or more archetypes by construction (because we aren’t considering a pointless ad hoc benchmark where we create only one archetype to demonstrate how good it is), right? Ok, let’s suppose it’s split in two archetypes for simplicity. So, you’ve two packed arrays vs one, the sum of the lengths of the former equals that of the latter. You’ve an extra jump somewhere in memory in the first case. So, it’s slower (in theory, of course, a single cache miss isn’t even appreciable but I’m sure you get what I mean).

Your answer to this point has been that you can rewrite a game based on groups with archetypes and obtain the same performance. True. I just replied that the opposite is the same and that from the beginning I’m saying that the two models are in the same ballpark and there doesn’t exist the best one. It’s just a matter of what feature you prefer since they offer a quite different set of them.

To sum up, you answered to my point with a sentence that was completely unrelated and I tried to point out it but still I obtained the same result plus a mention of something I never said. Sooooo…

Yeah. I think I’ve already told you that the way you split stuff isn’t the same all others do but let’s me underline this again.
Moreover, consider now you’ve two patterns:

  • With the former I use a full-owning group that has better performance than an archetype.
  • With the latter I use a partial owning group with a single free type that has the same performance of an archetype based iteration.

So, all in all I’m benefiting from the underlying model more but since I’m not using only full-owning groups it’s not enough for you. It’s… odd, at least. However, again, your whole point is based on the assumption that partial-owning groups and nested groups don’t exist (because of convenience in the discussion probably) and on this assumption you construct your conclusions that are therefore all the way wrong.

Have I already asked you to tell me about how you get around of the countless limitations of an archetype model?
I guess so but you forgot to answer apparently. And you forgot also to read what I wrote in my last comment as far as I can see, otherwise this comment wouldn’t make sense.

If you’re never looping through single component or queries with 2/3 types, you’ve not even the minimal idea of what a data oriented design can offer to you, trust me.
However, you want me to explain what exactly? How I don’t load unneeded data from oversized components? Sure, how can I do it? I’m willing to do that. Can I make you sign an NDA and show my code if the company agrees? I don’t get this exactly but I appreciate the fact that your sentence is based on the idea that @OvermindDL1and I are blatantly stupid human beings and you’re the only one that do things right. It’s not even offending. :smile:
In fact, I also appreciate the fact that you don’t just say nah you’re wrong and instead use a more elegant if you do this you don’t understand how things work. I don’t get exactly the difference but probably it’s my difficulty with the language.

I’ve already told you why your example and conclusions were all the way wrong but you keep ignoring my answers ans saying that I didn’t answer, so…

Again!? :astonished: Larger components? Unneeded data? I’m still trying to figure out how you deduced this from my words since it seems just an assumption made to justify the fact that you’re ignoring parts of the group based solution, their performance and how they enter the game, how they sum up and what the final result is.

You are not used to writing code which takes advantage of what you can do with a group based ECS because you don’t use a group based ECS. Just ignoring large parts of the solutions for your own advantage in the discussion and saying that you’ve only toying with model (which is what you said explicitly) is proving my point here.

Sure, join the EnTT channel. There is a guy that is working on a game with thousands of creation/destruction per tick. I like the fact that if you’ve never worked on something, that it doesn’t exist in the world from your point of view. :smile:
Insertion performance are one of the problems of an archetype based model. That’s a fact, what about just admitting it? This model has many problems as well as all other, also groups. It’s not a shame to recognize them. Instead, it shows that you fully understand the system you’re using and probably this isn’t the case.

There are millions of developers out there with tons of other software. You know that a single example, in particular one of yours that are slightly biased towards a particular problem, isn’t even eligible to be called a valid statistics, right?

The world is bigger than what you think then, other than full of people, situations and examples you haven’t considered.
I’m pretty sure you’re to say that if someone creates or destroys many entities in a tick is not using properly an ECS or something similar though. This seems to be the answer to all cases that differ slightly from what you do but still shit happens and someone does it, it’s a shame.

No. I said that in a chunk based archetype model you’ve a per-chunk allocation strategy and a per-type allocation strategy isn’t that easy or meaningful at all. In an array based archetype model you can have per-type allocations strategies but the fragmentation doesn’t make them particularly refined or interesting and it’s still worth using other strategies.
I’ve no idea what model Legion implements though, so I’m not talking about it in particular.

(Kae) #14

As a point of reference, Unity’s DOTS transform system splits things very granularly because the chunked archetype model encourages it: https://docs.unity3d.com/Packages/com.unity.entities@0.0/manual/transform_system.html

(Thomas Gillen) #15

Wrong how? Why is that wrong? It is because you end up with too many groups to try and keep optimised, perhaps?

No, the ideal ECS layout would be to put every single field in its own component, and then have every loop pull out only precisely what data it needs and iterate only over precisely the entities which actually have need for all off those fields. Your loop only needs x position, well only pull out that field. You need all 3 components? Fetch all 3.

That we don’t take things to that extreme is due to limitations in the efficiency of the ECS, and at the very extreme the extra code (and mental) complexity is hitting significant diminishing returns.

Really, every argument you make here is because you are not used to writing code which takes data oriented design to such an extreme - but it is quite practical in an archetypal ECS and encouraged.

Sorry, I was getting some of what you said and some of what @OvermindDL1 mixed up.

And that doesn’t mean it is actually slower. Yes, if you run exactly the same code in both cases, the archetype will have one extra cache miss per archetype in the result set, and run slower. But you would not write the same code in both cases. The code written with an archetypal ECS is highly likely be do less work due to the freedom to use much more granular components and selection without worrying about hitting the slow path. The performance degradation from archetypal fragmentation is much less significant than you seem to believe and the amount of work than can be saved in many cases is huge. @kabergstrom brought up how Unity use their ECS to calculate transformations - this module is one of the most data-heavy loops in a game and the granularity with which they dispatch work saves massive effort. Have a look through that code and you will see that within that module alone there are multiple large loops which access incompatible groups of components. All of those components are also ones which are global i.e. they are accessed all over the place by other game systems in combination with a great variety of other components. It is impossible to optimise this code with groups. If you tried to write this code (especially together with other code also accessing those components) it would perform terribly in EnTT and you would need to totally re-think how it works. But this code actually performs excellently in an archetypal ECS.

As I mentioned, this is based on the proposed solutions to these issues given by @OvermindDL1. That and that you appear to expect a very small number of components accessed in each loop, which is entirely in line with what Overmind suggested.

The only code you might write in a grouped ECS which would be highly inadvisable in an archetypal ECS is extremely high frequency component addition and removal. I have not only not “ignored” that, but I have addressed it multiple times in every post I have made in this thread.

I did not say that there is no such thing as a game which does this. I said nobody has given an example of a typical game which requires very fast insertion or removal. I have also acknowledged that grouped ECS can perform insertions more quickly. Multiple times. I was, in fact, the first person here to point out this weakness in the OP. I am not sure how you could miss this. My complaint is that it doesn’t matter enough to be a problem. Not to mention that if you are performing a huge number of allocations in one frame, you can do them in another world asynchronously and then merge it - totally side stepping the issue.

Given that we are discussing which design makes more sense for a general purpose engine, being able to point out esoteric games doesn’t much prove the point.

(Michele) #16

Yup, the size of the chunks in Unity encourages it to all level afaik. Chunks are pretty small to fit well if I remember right and larger components would defeat somehow the purpose otherwise.
You’re forced (to use a trending word here) to define your components in a very specific way rather than in the way you’d like to so as to match the design of the underlying tool. It’s not mandatory probably of course but encouraged as you correctly pointed out.

I’ve understood from another comment that it’s not that good for someone when the tool forces you to do things in a specific way.
This proves quite the opposite. The tools is designed in such a way that it gives you the best if you follow some rules.
It makes perfectly sense and the reasons for which they did it are known.

Btw things like position, translation, rotation and so on aren’t the best example of what splitting something means.
There is (more or less) a common sense on this group, with some variations on the topic but not that important. It’s way more interesting to look at how people split things in higher level systems usually.

(Thomas Gillen) #17

No this case performs much worse. That free type will perform unordered accesses, which is far slower than you ever actually see with archetypes.

I did not assume they were stupid. I assumed they were adapting to a new paradigm. Given that ECS is still barely used in the industry outside of hobyists and some indie games, and that they were presumably asking you for advise on how to use ECS, it seems totally reasonable for me to assume, in lieu of any other information, that they were likely adapting from OOP.

I don’t really need to know why they were doing this. You said they thought even two components was unusual, so they must be doing it for almost everything. There simply is no way that they are using an ECS correctly if most of your loops are over one component. Loops over one component can only do something useful if that one component really contains the data of multiple components… which is something grouped ECS appears to encourage.

(Thomas Gillen) #18

That’s not true at all. If they had instead rolled position, orientation, scale all into one component and did similar for a few of the others, like you might do with EnTT, it would still perform pretty close to EnTT. The issue there would be that they would now be performing the full workload for most entities, rather than only the bare minimum for the functionality each entity needs. That would far out-weigh the cost of cache misses for each archetype.

It might be more interesting, but Unity’s transform code demonstrates that you can extract great benefit from decomposing at this level. There is little reason not to do it with archetypes.

The ideal ECS would allow you to decompose to whatever level you need to exactly match the data needs of your program logic, and then serve you precisely that data as efficiently as possible. You can’t properly take advantage of this when you are using groups because it does not efficiently offer the number of loop permutations created. You wouldn’t write code that took it this far with a grouping ECS because it would be slow, but you absolutely do write this style of code with an archetypal ECS because it is fast.

(Michele) #19

Ahah. No, you’re a bit obsessed by these groups. I can have also opinions that abstract from the specific implementation. Can’t you?

Just to be clear, extreme granularity isn’t good even for archetypes if it leads to an explosion of combinations because this is the worst thing for the fragmentation problem.
If this is the case mostly depends on a lot of factors though, of course none of them is considered by your toy example with position and rotation.

No, really no. This is the ideal (meh) when it comes to talking about fully data oriented design, not ECS. You’re mixing things up. They aren’t the same beast, I’m surprised you said that.
To be honest the fact that it’s ideal is arguable also in this case, since it’s the ideal only in theory while it’s good in practice for some specific use cases that you certainly know even though you liked to generalize.

Btw, if the vast majority of your queries involve 8+ types you’re definitely exaggerating with granularity. If you’ve never written a query that involves a single type or only 2/3 of them, you’re not using all the potentials of an ECS.
This is a fact. Why? C’mon. Either you’re provoking me or you’re pretending not to know because now you wrote it.
I’d wonder if you have ever written a system other than those that update the position and rotation of an entity otherwise, because I can’t really believe you want to convince us that queries with less than 8 types are wrong or uncommon. This doesn’t make any sense. Even the link from @kabergstrom contains a lot of examples of these queries and almost all the publicly available examples of software based on components do large use of queries with less than 6/7/8 types. They are by far more common than those with more types, especially in high level systems.

Less types in a query doesn’t mean more data loaded though.
If you think so (admittedly, I think you’re just pretending us to believe you don’t know this isn’t the case because we are at a non-returning point on this fact), you’ve probably no experience with this kind of tools and it would have been better to tell it from the beginning.
However, since I don’t want to make your same assumption and thus I don’t want to treat you as if you don’t know of what you’re talking about, I’m accepting the fact that you’ll never admit that queries with 2/3 types that load only the amount of data required for the particular task are extremely common and we are just at a point in the discussion from which we cannot really say it.

So we can agree on the fact that we disagree (:wink: :wink:) here and let it be.

Really, you can’t justify why 8+ types are good (did you do that? Oops, no, curious) but you keep offending and pretending others have no experience or that queries with 1/2/3 types are extremely uncommon.
I’ve also made an example but you completely missed it and cannot even realize you’ve it at hand.

This is an extremely unpleasant approach and one can argue that this offending technique is meant to hide the fact that you’re not used at all at properly designing components, so just saying that the others are not used to writing code which takes data oriented design to such an extreme should give you the reason in your mind.
Moreover, stop mixing ECS and data oriented design as if they were the same thing. This tells a lot more than what you think.

Do you see how you contradict yourself?
So, if we get something that works well with groups and does not with archetypes, that’s because it’s not meant for archetypes and the example is wrong.
However, if we get something that isn’t meant for groups and is designed around archetypes, than the examples shows exactly your point. Chapeau. :smile: :clap:

Assumptions. Random sentences and hypothesis (in one case you’ll hit a slow path for <put here your reason, I forgot to mention it, oops>). Conclusions without evidence. A likely put in the middle of the sentence to make it clear that you’re just guessing and I won’t even consider the fact that you underestimate the impact of fragmentation as usual (I also remember of a thread on the Unity forum from a guy that hit this problem a while ago but yeah, we are overestimating it, it’s not real).
The funniest thing though is that the only example (if we can call it an example) you brought us to prove all this so far is that you (that admitted and proved to have less to zero experience at working with groups) tried to do something a while ago and one of the two models gave better results.
So, can I say the opposite and stand here pretending that I’m right as well without evidence? No thanks. It’s just as funny as it can be that you’re still iterating on this point after 15 comments.

Is that the only one? You should go a bit deeper on the theory of these models then but this proves only that (correctly, you admitted it) you didn’t try hard to explore both of them.
I’m sure you can’t even imagine a case in which the model of entityx beats both groups and archetypes, right? There doesn’t exist for sure on your book.

Put aside that a game isn’t esoteric just because it doesn’t fit the model you like or doesn’t resemble the projects you worked on so far (but I’m accustomed at your offending attitude at this point, I’m just sad because this time it’s not pointed towards me), finally we agree on something.
Have you read what I wrote a few posts above? Probably not, otherwise you wouldn’t have spent so many words to prove nothing. Let me quote it:

I wouldn’t use a system that tries to optimize everything up to all that stuff for which I don’t want to spend my cpu cycles, a model that doesn’t let me decide what and how I want to optimize things, a solution that makes creation/deletion a problem, that suffers from fragmentation, that forces me to define component types in a specific way to match its chunk size, a design that is all against the allocation strategies I’d use

BUT (emphasis on BUT)

For a general purpose engine, imho archetypes are definitely a good fit. Probably better than any other.

Not because of their performance or all the fancy feature you tried to prove without any evidence but your experience that never resulted in explicit numbers or lines of code to look into.
They are a good fit because they don’t require any particular experience user side and are in the same ballpark with the other solution. Groups are not for everybody (you proved this if I got what you told us of your experience right) and require users to know what they want and what they do.
For newcomers, this isn’t the best thing ever. Quite the opposite in fact. Archetypes here are a better fit.
It’s true that newcomers won’t have probably a strict requirement in terms of performance that justifies any of these solutions but still…

(Michele) #20

Dude, really, I’ve been involved in a large bench session on this exact point. You’re just guessing on a topic you don’t know here. Stop pretending you know how things work without ever try them.

First of all, they weren’t using a grouped ECS. More important, try to open your mind.
I can easily write an application that loops only on single components and use the exact same types you would use in your own.
It has nothing to do with groups, it has to do with your lack of experience on this topic. The same from which the industry suffers, as you correctly pointed out. The fact that you can’t imagine how to do this doesn’t mean that the only solution is the one in which you’d put everything in a fat type.
Again, you’re assuming others are stupid here and they were not. Just… odd, this I grant you.

Are you curious now? :smile: