The RCU-barrier menagerie
There are many issues with excessive use of memory barriers, one of which
is their overhead.
It turns out that RCU can be substituted for
memory barriers in a great many cases, and this substitution allows one
of the memory barriers of the pair to be removed.
Unfortunately, in most cases, the other memory barrier must be replaced
by an expensive grace-period-wait primitive such as
synchronize_rcu(),
but there are a few special cases where synchronize_rcu()
is not needed.
The key point behind this first substitution is the fundamental law
of RCU, namely that RCU read-side critical sections cannot span grace
periods.
As a result, if any statement in a given
RCU read-side critical section precedes the start of any grace period,
then the entire critical section must precede the end of that same
grace period, as shown in the diagram below, where
synchronize_rcu() provides the grace period:
$ sudo subscribe todaySubscribe today and elevate your LWN privileges. You’ll have access to all of LWN’s high-quality articles as soon as they’re published, and help support LWN in the process. Act now and you can start with a free trial subscription.
More specifically, given X and Y initially zero,
recalling that CAO() is shorthand for
CMM_ACCESS_ONCE().
This corresponds to the following code fragment:
CPU 0 CPU 1 rcu_read_lock(); CAO(Y) = 1; r1 = CAO(X); synchronize_rcu(); r2 = CAO(Y); CAO(X) = 1; rcu_read_unlock();
In contrast with scenarios involving only memory barriers, the same guarantee is provided even if CPU 0's reads are reversed:
CPU 0 CPU 1 rcu_read_lock(); CAO(Y) = 1; r2 = CAO(Y); synchronize_rcu(); r1 = CAO(X); CAO(X) = 1; rcu_read_unlock();
Although RCU is getting the same result as do memory barriers,
this example shows that the mechanism differs subtly.
Either way, if r1 == 1, then it must also be the case that
r2 == 1.
The RCU guarantee can be reversed: If any statement in a given
RCU read-side critical section follows the end of any grace period,
then the entire critical section must follow the start of that same
grace period.
For example, referring to the previous diagram, if r2 == 1
(so that a portion of the RCU read-side critical section follows
the end of the grace period),
then RCU guarantees that r1 == 1
(so that all of the RCU read-side critical section follows the beginning
of the grace period).
This is illustrated by the diagram on the right.
Therefore, although RCU operates in a very different manner than do memory barriers, it can produce very similar results. This can be seen in the menagerie below.
The menagerie
The following table summarizes the two-variable RCU scenarios.
In all cases, the initial values of both x and y
are zero.
The values of the variables in the assertions are their final values,
taken after all execution has ceased and the values from all assignment
statements have propagated throughout the entire system.
Although CPU 0 and CPU 1 are the main actors in these scenarios,
some scenarios require additional stores, with Scenario 0 being the
most extreme case.
Each access to x and y must have an
CMM_ACCESS_ONCE() to prevent the compiler from
playing
nasty games.
As a service to those with finite displays, we substitute CAO()
via the following definition:
#define CAO(a) CMM_ACCESS_ONCE(a)
| Scenario | CPU 0 | CPU 1 | CPU 2 | CPU 3 | BUG_ON() Expression | |
|---|---|---|---|---|---|---|
| 0-RCU | rcu_read_lock(); |
r3 = CAO(y); |
CAO(x) = 1; |
CAO(y) = 1; |
r1 == 1 && r2 == 0 && |
IRIW (mostly academic) |
| 1-RCU | rcu_read_lock(); |
r3 = CAO(y); |
CAO(y) = 1; |
r1 == 1 && r2 == 0 && |
||
| 2-RCU | rcu_read_lock(); |
CAO(y) = 1; |
CAO(x) = 1; |
r1 == 1 && r2 == 0 && |
||
| 3-RCU | rcu_read_lock(); |
CAO(y) = 1; |
r1 == 1 && r2 == 0 |
Message (y) and flag (x) | ||
| 4-RCU | rcu_read_lock(); |
r3 = CAO(y); |
CAO(x) = 1; |
r1 == 1 && r3 == 1 && |
||
| 5-RCU | rcu_read_lock(); |
r3 = CAO(y); |
r1 == 1 && r3 == 1 |
Inverse Dekker | ||
| 6-RCU | rcu_read_lock(); |
CAO(y) = 2; |
CAO(x) = 1; |
y == 2 && r1 == 1 && |
||
| 7-RCU | rcu_read_lock(); |
CAO(y) = 2; |
y == 2 && r1 == 1 |
|||
| 8-RCU | rcu_read_lock(); |
r3 = CAO(y); |
CAO(y) = 1; |
r2 == 0 && r3 == 1 && |
||
| 9-RCU | rcu_read_lock(); |
r3 = CAO(y); |
CAO(y) = 1; |
x == 2 && r2 == 0 && |
||
| 10-RCU | rcu_read_lock(); |
CAO(y) = 1; |
r2 == 0 && r4 == 0 |
Dekker | ||
| 11-RCU | rcu_read_lock(); |
CAO(y) = 1; |
x == 2 && r2 == 0 |
|||
| 12-RCU | rcu_read_lock(); |
r3 = CAO(y); |
r3 == 1 && r4 == 0 |
Message (x) and flag (y) | ||
| 13-RCU | rcu_read_lock(); |
r3 = CAO(y); |
x == 2 && r3 == 1 |
|||
| 14-RCU | rcu_read_lock(); |
CAO(y) = 2; |
y == 2 && r4 == 0 |
|||
| 15-RCU | rcu_read_lock(); |
CAO(y) = 1; |
x == 1 && y == 1 |
|||
It is important to note that although synchronize_rcu()
provides transitivity, RCU read-side critical sections do not.
To see this, consider the following example:
CPU 0 CPU 1 CPU 2 rcu_read_lock(); rcu_read_lock(); r3 = z; x = 1; r1 = y; synchronize_rcu(); y = 1; z = 1; r4 = x; rcu_read_unlock(); rcu_read_unlock();
One might hope that r1 == 1 && r3 == 1 would imply
that r4 == 1, but this is not guaranteed.
The reason for this is that although r3 == 1 guarantees
that CPU 1's z = 1 happened before the beginning
of CPU 2's grace period, the fundamental law of RCU only guarantees
that r1 = y happened before the end of
CPU 2's grace period,
This in turn means that CPU 0's y = 1 must have happened
before the end of CPU 2's grace period, but that says nothing about
when CPU 0's x = 1 might have happened.
Therefore, r1 == 1 && r3 == 1 && r4 == 0
is a perfectly legal (if counterintuitive) result of the preceding
code fragment.
Be careful. RCU provides very lightweight readers, but it does so by providing correspondingly lightweight guarantees.
Quick Quiz 1: But what if I need transitivity so thatr1 == 1 && r3 == 1 && r4 == 0cannot happen? What can I do?Quick Quiz 2: This approach is nice in that it gets rid of one set of memory barriers, but it turns the other set of memory barriers into horribly expensive
synchronize_rcu()invocations. Can't we do better than this?
The lighter side of the menagerie
In his dissertation [PDF], Josh Triplett described a key rule for using RCU on linked data structures in which the paths traversed by readers form a directed acyclic graph. This rule assumes that a given reader can tolerate seeing no more than one change during its traversal, an assumption that holds in a great many RCU use cases. The rule is as follows:
- When an update modifies data elements in the opposite order that
readers traverse them, a memory barrier is required
between each pair of modifications making up the update.
- When an update modifies data elements in the same order that readers traverse them, a full grace period must elapse between each pair of modifications making up the update.
Insertion into an RCU-protected linked list falls under the first
part of this rule.
Readers first pick up a pointer, and only then access the referenced
data element.
In contrast, updaters initialize the data element, and only then
store a pointer to it.
Hence, updaters are updating in the opposite order from the readers'
traversals, which is why rcu_assign_pointer() needs only
a memory barrier.
In contrast, deletion from an RCU-protected linked list falls under
the second part of this rule.
Readers still pick up a pointer and only then access the referenced
data element, but writers update the pointer first and reclaim
the referenced data element afterward.
This means that updaters are updating in the same order as the readers'
traversals, which is why you must wait for a grace period between
cds_list_del_rcu() and free().
However, each entry in the menagerie has CPUs 0 and 1
access x and y in opposite orders, which
means that we should be able to use the first part of the above rule
in at least some cases.
Clearly the reading CPU needs to read a pointer as its first operation,
and the writing CPU needs to write that same pointer as its second
operation.
Thus Scenarios 1, 3, 5, and 7 are possible candidates
for lightweight synchronization.
The lightweight versions of these scenarios are shown in the following
table, with the initial value of x being &z,
and those of y and z being zero.
| Scenario | CPU 0 | CPU 1 | CPU 2 | CPU 3 | BUG_ON() Expression | |
|---|---|---|---|---|---|---|
| 1-light | rcu_read_lock(); |
r3 = CAO(y); |
CAO(y) = 1; |
r1 == &y && r2 == 0 && |
||
| 3-light | rcu_read_lock(); |
y = 1; |
r1 == &y && r2 == 0 |
Message (y) andmessage pointer ( x) | ||
| 5-light | rcu_read_lock(); |
r3 = CAO(y); |
r3 == 1 |
|||
| 7-light | rcu_read_lock(); |
y = 2; |
y == 2 && r1 == &y |
|||
Quick Quiz 3: It looks to me like thercu_read_lock()andrcu_read_unlock()are not really needed. Why are they there?Quick Quiz 4: Why is there no
CAO()for the assignments in Scenarios 3-light and 7-light?
Scenarios 3-light and 7-light
both correspond to the standard
insertion into an RCU-protected linked list.
Scenario 3-light is perhaps the most familiar, guaranteeing that the RCU
reader will see the initialization corresponding to the pointer returned
by rcu_dereference().
However, Scenario 7-light is equally important,
as it guarantees that
any RCU read-side update to a given data element won't get overwritten
by that element's initialization.
The assertions for Scenarios 1-light and 3-light can be
simplified by setting the initial value of z to the value one,
which allows the r1 == &y term to be removed.
Quick Quiz 5:
How can we remove the ugly r1 == &y term from the
assertion for Scenario 7-light?
Scenario 1-light is less familiar, but if things continue as they have for the past 20 years, it will find serious use soon enough.
Conclusions
The RCU menagerie shows how RCU can reduce overheads of readers at the expense of that of updaters, while the -light menagerie shows that there are some special cases where RCU can reduce overheads of readers with no corresponding increase in update-side overhead.
Although use of RCU as a memory barrier is less common than the more conventional use as read-side protection for read-mostly data structures, it is starting to appear in the Linux kernel.
More information
-
Is Parallel Programming Hard, And, If So, What Can You Do About It?
by Paul E. McKenney. See Chapter 8.
-
User-Level Implementations of Read-Copy Update [PDF]
by Mathieu Desnoyers, Paul E. McKenney, Alan Stern,
Michel R. Dagenais, and Jonathan Walpole.
- Verifying concurrent memory reclamation algorithms with grace [PDF] by Alexey Gotsman, Noam Rinetzky, and Hongseok Yang.
Answers to Quick Quizzes
Quick Quiz 1:
But what if I need transitivity so that
r1 == 1 && r3 == 1 && r4 == 0 cannot happen?
What can I do?
Answer: Add more grace periods, for example, as follows:
CPU 0 CPU 1 CPU 2 rcu_read_lock(); r1 = y; r3 = z; x = 1; synchronize_rcu(); synchronize_rcu(); y = 1; z = 1; r4 = x; rcu_read_unlock();
Here the fact that r3 == 1 means that CPU 1's
z = 1 happened before the start of CPU 2's grace period, but
CPU 1's synchronize_rcu() now guarantees that
CPU 1's r1 = y also happened before the start of CPU 2's
grace period.
This in turn means that CPU 0's y = 1 happened before
the start of CPU 2's grace period, which means that RCU guarantees that
CPU 0's x = 1 happens before the end of CPU 2's
grace period, thus guaranteeing that r4 == 1, as desired.
Additional guarantees come at the cost of additional overhead. Almost like real life!
Quick Quiz 2:
This approach is nice in that it gets rid of one set of memory barriers,
but it turns the other set of memory barriers into horribly expensive
synchronize_rcu() invocations.
Can't we do better than this?
Answer:
Indeed we can!
One approach is to use call_rcu(), thus getting rid of
the asynchronous latency.
Another approach is to go back to the original RCU use case, which exploits dependency ordering, as discussed in the next section.
Quick Quiz 3:
It looks to me like the rcu_read_lock() and
rcu_read_unlock() are not really needed.
Why are they there?
Answer:
Indeed, the rcu_read_lock() and rcu_read_unlock()
have nothing to do in these specific scenarios because there are no
grace periods.
However, grace periods would be needed if it was necessary to
free() or otherwise reclaim or reuse z.
Given the current unpopularity of memory leaks, such reclamation or
reuse is almost always required, and thus so are the
rcu_read_lock() and rcu_read_unlock().
Therefore, the rcu_read_lock() and
rcu_read_unlock() will remain.
Better safe than sorry, especially when the cost of safety
is only the tiny overhead of a rcu_read_lock() and a
rcu_read_unlock().
Quick Quiz 4:
Why is there no CAO() for the assignments in
Scenarios 3-light and 7-light?
Answer:
They are not needed because there is no way that the assignments to and
from y can execute concurrently.
In contrast, in Scenario 1-light, CPU 3 could assign to
y at any time, including when CPUs 0 and 1
are referencing it.
Scenario 5-light is a special case.
Here, the cmm_smp_wmb() included in
rcu_assign_pointer() is not sufficient to order the prior
load from y against the subsequent store into x.
Therefore, we open-code a full barrier and CMM_ACCESS_ONCE().
Thus, Scenario 5-light shows a guarantee that the normal RCU
primitives do not make: When using rcu_assign_pointer(),
a load prior to the rcu_assign_pointer() could see the
effects of a store in a concurrent RCU read-side critical section,
even if that critical section saw the subsequent store into the
RCU-protected pointer.
Quick Quiz 5:
How can we remove the ugly r1 == &y term from the
assertion for Scenario 7-light?
Answer:
We could replace it with z == 1, but only if the initial
value of z remained zero.
However, simply removing the r1 == &y term fails
because the resulting BUG_ON() condition would be
y == 2, but that is a perfectly legitimate outcome.
