The RCU-barrier menagerie [LWN.net]

November 12, 2013

This article was contributed by Paul E. McKenney, Mathieu Desnoyers, Lai Jiangshan, and Josh Triplett

There are many issues with excessive use of memory barriers, one of which is their overhead. It turns out that RCU can be substituted for memory barriers in a great many cases, and this substitution allows one of the memory barriers of the pair to be removed. Unfortunately, in most cases, the other memory barrier must be replaced by an expensive grace-period-wait primitive such as synchronize_rcu(), but there are a few special cases where synchronize_rcu() is not needed.

The key point behind this first substitution is the fundamental law of RCU, namely that RCU read-side critical sections cannot span grace periods. As a result, if any statement in a given RCU read-side critical section precedes the start of any grace period, then the entire critical section must precede the end of that same grace period, as shown in the diagram below, where synchronize_rcu() provides the grace period:

$ sudo subscribe today
Subscribe today and elevate your LWN privileges. You’ll have access to all of LWN’s high-quality articles as soon as they’re published, and help support LWN in the process. Act now and you can start with a free trial subscription.

More specifically, given X and Y initially zero, recalling that CAO() is shorthand for CMM_ACCESS_ONCE().

This corresponds to the following code fragment:

CPU 0                    CPU 1

rcu_read_lock();         CAO(Y) = 1;
r1 = CAO(X);             synchronize_rcu();
r2 = CAO(Y);             CAO(X) = 1;
rcu_read_unlock();

In contrast with scenarios involving only memory barriers, the same guarantee is provided even if CPU 0's reads are reversed:

CPU 0                    CPU 1

rcu_read_lock();         CAO(Y) = 1;
r2 = CAO(Y);             synchronize_rcu();
r1 = CAO(X);             CAO(X) = 1;
rcu_read_unlock();

Although RCU is getting the same result as do memory barriers, this example shows that the mechanism differs subtly. Either way, if r1 == 1, then it must also be the case that r2 == 1.

The RCU guarantee can be reversed: If any statement in a given RCU read-side critical section follows the end of any grace period, then the entire critical section must follow the start of that same grace period. For example, referring to the previous diagram, if r2 == 1 (so that a portion of the RCU read-side critical section follows the end of the grace period), then RCU guarantees that r1 == 1 (so that all of the RCU read-side critical section follows the beginning of the grace period). This is illustrated by the diagram on the right.

Therefore, although RCU operates in a very different manner than do memory barriers, it can produce very similar results. This can be seen in the menagerie below.

The menagerie

The following table summarizes the two-variable RCU scenarios. In all cases, the initial values of both x and y are zero. The values of the variables in the assertions are their final values, taken after all execution has ceased and the values from all assignment statements have propagated throughout the entire system. Although CPU 0 and CPU 1 are the main actors in these scenarios, some scenarios require additional stores, with Scenario 0 being the most extreme case.

Each access to x and y must have an CMM_ACCESS_ONCE() to prevent the compiler from playing nasty games. As a service to those with finite displays, we substitute CAO() via the following definition:

    #define CAO(a) CMM_ACCESS_ONCE(a)

Scenario	CPU 0	CPU 1	CPU 2	CPU 3	`BUG_ON()` Expression
0-RCU	`rcu_read_lock(); r1 = CAO(x); r2 = CAO(y); rcu_read_unlock()`	`r3 = CAO(y); synchronize_rcu(); r4 = CAO(x);`	`CAO(x) = 1;`	`CAO(y) = 1;`	`r1 == 1 && r2 == 0 && r3 == 1 && r4 == 0`	IRIW (mostly academic)
1-RCU	`rcu_read_lock(); r1 = CAO(x); r2 = CAO(y); rcu_read_unlock()`	`r3 = CAO(y); synchronize_rcu(); CAO(x) = 1;`		`CAO(y) = 1;`	`r1 == 1 && r2 == 0 && r3 == 1`
2-RCU	`rcu_read_lock(); r1 = CAO(x); r2 = CAO(y); rcu_read_unlock();`	`CAO(y) = 1; synchronize_rcu(); r4 = CAO(x);`	`CAO(x) = 1;`		`r1 == 1 && r2 == 0 && r4 == 0`
3-RCU	`rcu_read_lock(); r1 = CAO(x); r2 = CAO(y); rcu_read_unlock();`	`CAO(y) = 1; synchronize_rcu(); CAO(x) = 1;`			`r1 == 1 && r2 == 0`	Message (`y`) and flag (`x`)

4-RCU	`rcu_read_lock(); r1 = CAO(x); CAO(y) = 1; rcu_read_unlock();`	`r3 = CAO(y); synchronize_rcu(); r4 = CAO(x);`	`CAO(x) = 1;`		`r1 == 1 && r3 == 1 && r4 == 0`
5-RCU	`rcu_read_lock(); r1 = CAO(x); CAO(y) = 1; rcu_read_unlock();`	`r3 = CAO(y); synchronize_rcu(); CAO(x) = 1;`			`r1 == 1 && r3 == 1`	Inverse Dekker
6-RCU	`rcu_read_lock(); r1 = CAO(x); CAO(y) = 1; rcu_read_unlock();`	`CAO(y) = 2; synchronize_rcu(); r4 = CAO(x);`	`CAO(x) = 1;`		`y == 2 && r1 == 1 && r4 == 0`
7-RCU	`rcu_read_lock(); r1 = CAO(x); CAO(y) = 1; rcu_read_unlock();`	`CAO(y) = 2; synchronize_rcu(); CAO(x) = 1;`			`y == 2 && r1 == 1`

8-RCU	`rcu_read_lock(); CAO(x) = 1; r2 = CAO(y); rcu_read_unlock();`	`r3 = CAO(y); synchronize_rcu(); r4 = CAO(x);`		`CAO(y) = 1;`	`r2 == 0 && r3 == 1 && r4 == 0`
9-RCU	`rcu_read_lock(); CAO(x) = 2; r2 = CAO(y); rcu_read_unlock();`	`r3 = CAO(y); synchronize_rcu(); CAO(x) = 1;`		`CAO(y) = 1;`	`x == 2 && r2 == 0 && r3 == 1`
10-RCU	`rcu_read_lock(); CAO(x) = 1; r2 = CAO(y); rcu_read_unlock();`	`CAO(y) = 1; synchronize_rcu(); r4 = CAO(x);`			`r2 == 0 && r4 == 0`	Dekker
11-RCU	`rcu_read_lock(); CAO(x) = 2; r2 = CAO(y); rcu_read_unlock();`	`CAO(y) = 1; synchronize_rcu(); CAO(x) = 1;`			`x == 2 && r2 == 0`

12-RCU	`rcu_read_lock(); CAO(x) = 1; CAO(y) = 1; rcu_read_unlock();`	`r3 = CAO(y); synchronize_rcu(); r4 = CAO(x);`			`r3 == 1 && r4 == 0`	Message (x) and flag (y)
13-RCU	`rcu_read_lock(); CAO(x) = 2; CAO(y) = 1; rcu_read_unlock();`	`r3 = CAO(y); synchronize_rcu(); CAO(x) = 1;`			`x == 2 && r3 == 1`
14-RCU	`rcu_read_lock(); CAO(x) = 1; CAO(y) = 1; rcu_read_unlock();`	`CAO(y) = 2; synchronize_rcu(); r4 = CAO(x);`			`y == 2 && r4 == 0`
15-RCU	`rcu_read_lock(); CAO(x) = 1; CAO(y) = 2; rcu_read_unlock();`	`CAO(y) = 1; synchronize_rcu(); CAO(x) = 2;`			`x == 1 && y == 1`

It is important to note that although synchronize_rcu() provides transitivity, RCU read-side critical sections do not. To see this, consider the following example:

CPU 0                    CPU 1                    CPU 2

rcu_read_lock();         rcu_read_lock();         r3 = z;
x = 1;                   r1 = y;                  synchronize_rcu();
y = 1;                   z = 1;                   r4 = x;
rcu_read_unlock();       rcu_read_unlock();

One might hope that r1 == 1 && r3 == 1 would imply that r4 == 1, but this is not guaranteed. The reason for this is that although r3 == 1 guarantees that CPU 1's z = 1 happened before the beginning of CPU 2's grace period, the fundamental law of RCU only guarantees that r1 = y happened before the end of CPU 2's grace period, This in turn means that CPU 0's y = 1 must have happened before the end of CPU 2's grace period, but that says nothing about when CPU 0's x = 1 might have happened. Therefore, r1 == 1 && r3 == 1 && r4 == 0 is a perfectly legal (if counterintuitive) result of the preceding code fragment.

Be careful. RCU provides very lightweight readers, but it does so by providing correspondingly lightweight guarantees.

Quick Quiz 1: But what if I need transitivity so that r1 == 1 && r3 == 1 && r4 == 0 cannot happen? What can I do?
Answer
Quick Quiz 2: This approach is nice in that it gets rid of one set of memory barriers, but it turns the other set of memory barriers into horribly expensive synchronize_rcu() invocations. Can't we do better than this?
Answer

The lighter side of the menagerie

In his dissertation [PDF], Josh Triplett described a key rule for using RCU on linked data structures in which the paths traversed by readers form a directed acyclic graph. This rule assumes that a given reader can tolerate seeing no more than one change during its traversal, an assumption that holds in a great many RCU use cases. The rule is as follows:

When an update modifies data elements in the opposite order that readers traverse them, a memory barrier is required between each pair of modifications making up the update.
When an update modifies data elements in the same order that readers traverse them, a full grace period must elapse between each pair of modifications making up the update.

Insertion into an RCU-protected linked list falls under the first part of this rule. Readers first pick up a pointer, and only then access the referenced data element. In contrast, updaters initialize the data element, and only then store a pointer to it. Hence, updaters are updating in the opposite order from the readers' traversals, which is why rcu_assign_pointer() needs only a memory barrier.

In contrast, deletion from an RCU-protected linked list falls under the second part of this rule. Readers still pick up a pointer and only then access the referenced data element, but writers update the pointer first and reclaim the referenced data element afterward. This means that updaters are updating in the same order as the readers' traversals, which is why you must wait for a grace period between cds_list_del_rcu() and free().

However, each entry in the menagerie has CPUs 0 and 1 access x and y in opposite orders, which means that we should be able to use the first part of the above rule in at least some cases. Clearly the reading CPU needs to read a pointer as its first operation, and the writing CPU needs to write that same pointer as its second operation. Thus Scenarios 1, 3, 5, and 7 are possible candidates for lightweight synchronization.

The lightweight versions of these scenarios are shown in the following table, with the initial value of x being &z, and those of y and z being zero.

Scenario	CPU 0	CPU 1	CPU 3	`BUG_ON()` Expression
1-light	`rcu_read_lock(); r1 = rcu_dereference(x); r2 = CAO(*r1); rcu_read_unlock()`	`r3 = CAO(y); cmm_smp_mb(); rcu_assign_pointer(x, &y);`	`CAO(y) = 1;`	`r1 == &y && r2 == 0 && r3 == 1`
3-light	`rcu_read_lock(); r1 = rcu_dereference(x); r2 = *r1; rcu_read_unlock();`	`y = 1; rcu_assign_pointer(x, &y);`		`r1 == &y && r2 == 0`	Message (`y`) and message pointer (`x`)

5-light	`rcu_read_lock(); r1 = rcu_dereference(x); CAO(*r1) = 1; rcu_read_unlock();`	`r3 = CAO(y); cmm_smp_mb(); CAO(x) = &y;`		`r3 == 1`
7-light	`rcu_read_lock(); r1 = rcu_dereference(x); *r1 = 1; rcu_read_unlock();`	`y = 2; rcu_assign_pointer(x, &y);`		`y == 2 && r1 == &y`

Quick Quiz 3: It looks to me like the rcu_read_lock() and rcu_read_unlock() are not really needed. Why are they there?
Answer
Quick Quiz 4: Why is there no CAO() for the assignments in Scenarios 3-light and 7-light?
Answer

Scenarios 3-light and 7-light both correspond to the standard insertion into an RCU-protected linked list. Scenario 3-light is perhaps the most familiar, guaranteeing that the RCU reader will see the initialization corresponding to the pointer returned by rcu_dereference(). However, Scenario 7-light is equally important, as it guarantees that any RCU read-side update to a given data element won't get overwritten by that element's initialization.

The assertions for Scenarios 1-light and 3-light can be simplified by setting the initial value of z to the value one, which allows the r1 == &y term to be removed.

Quick Quiz 5: How can we remove the ugly r1 == &y term from the assertion for Scenario 7-light?
Answer

Scenario 1-light is less familiar, but if things continue as they have for the past 20 years, it will find serious use soon enough.

Conclusions

The RCU menagerie shows how RCU can reduce overheads of readers at the expense of that of updaters, while the -light menagerie shows that there are some special cases where RCU can reduce overheads of readers with no corresponding increase in update-side overhead.

Although use of RCU as a memory barrier is less common than the more conventional use as read-side protection for read-mostly data structures, it is starting to appear in the Linux kernel.

More information

Is Parallel Programming Hard, And, If So, What Can You Do About It? by Paul E. McKenney. See Chapter 8.
User-Level Implementations of Read-Copy Update [PDF] by Mathieu Desnoyers, Paul E. McKenney, Alan Stern, Michel R. Dagenais, and Jonathan Walpole.
Verifying concurrent memory reclamation algorithms with grace [PDF] by Alexey Gotsman, Noam Rinetzky, and Hongseok Yang.

Answers to Quick Quizzes

Quick Quiz 1: But what if I need transitivity so that r1 == 1 && r3 == 1 && r4 == 0 cannot happen? What can I do?

Answer: Add more grace periods, for example, as follows:

CPU 0                    CPU 1                    CPU 2

rcu_read_lock();         r1 = y;                  r3 = z;
x = 1;                   synchronize_rcu();       synchronize_rcu();
y = 1;                   z = 1;                   r4 = x;
rcu_read_unlock();

Here the fact that r3 == 1 means that CPU 1's z = 1 happened before the start of CPU 2's grace period, but CPU 1's synchronize_rcu() now guarantees that CPU 1's r1 = y also happened before the start of CPU 2's grace period. This in turn means that CPU 0's y = 1 happened before the start of CPU 2's grace period, which means that RCU guarantees that CPU 0's x = 1 happens before the end of CPU 2's grace period, thus guaranteeing that r4 == 1, as desired.

Additional guarantees come at the cost of additional overhead. Almost like real life!

Back to Quick Quiz 1.

Quick Quiz 2: This approach is nice in that it gets rid of one set of memory barriers, but it turns the other set of memory barriers into horribly expensive synchronize_rcu() invocations. Can't we do better than this?

Answer: Indeed we can! One approach is to use call_rcu(), thus getting rid of the asynchronous latency.

Another approach is to go back to the original RCU use case, which exploits dependency ordering, as discussed in the next section.

Back to Quick Quiz 2.

Quick Quiz 3: It looks to me like the rcu_read_lock() and rcu_read_unlock() are not really needed. Why are they there?

Answer: Indeed, the rcu_read_lock() and rcu_read_unlock() have nothing to do in these specific scenarios because there are no grace periods. However, grace periods would be needed if it was necessary to free() or otherwise reclaim or reuse z. Given the current unpopularity of memory leaks, such reclamation or reuse is almost always required, and thus so are the rcu_read_lock() and rcu_read_unlock().

Therefore, the rcu_read_lock() and rcu_read_unlock() will remain. Better safe than sorry, especially when the cost of safety is only the tiny overhead of a rcu_read_lock() and a rcu_read_unlock().

Back to Quick Quiz 3.

Quick Quiz 4: Why is there no CAO() for the assignments in Scenarios 3-light and 7-light?

Answer: They are not needed because there is no way that the assignments to and from y can execute concurrently. In contrast, in Scenario 1-light, CPU 3 could assign to y at any time, including when CPUs 0 and 1 are referencing it.

Scenario 5-light is a special case. Here, the cmm_smp_wmb() included in rcu_assign_pointer() is not sufficient to order the prior load from y against the subsequent store into x. Therefore, we open-code a full barrier and CMM_ACCESS_ONCE(). Thus, Scenario 5-light shows a guarantee that the normal RCU primitives do not make: When using rcu_assign_pointer(), a load prior to the rcu_assign_pointer() could see the effects of a store in a concurrent RCU read-side critical section, even if that critical section saw the subsequent store into the RCU-protected pointer.

Back to Quick Quiz 4.

Quick Quiz 5: How can we remove the ugly r1 == &y term from the assertion for Scenario 7-light?

Answer: We could replace it with z == 1, but only if the initial value of z remained zero.

However, simply removing the r1 == &y term fails because the resulting BUG_ON() condition would be y == 2, but that is a perfectly legitimate outcome.

Back to Quick Quiz 5.