In this section it will be convenient to number matrix entries (or subblocks)
and processors from
to
instead of
to
.
On distributed memory machines the cost model is more complicated than on
shared memory machines, because we will need to worry about the
data layout, or how the matrices are
partitioned across the machine. This will determine both the amount
of parallelism and the cost of communication. Recall from the chapter
on Computer Architecture
that the cost of sending a message of
words
from one processor to another is
, where
is the start-up cost or latency, and
is the per-word cost,
or reciprocal of bandwidth. Therefore to assess the cost of an algorithm
we need to count the number of floating point operations, the number of
messages sent (at a cost of
per message), and the total length
of messages sent (at a cost of
per word).
We begin by showing how best to implement matrix multiplication without
regard to the layout's suitability for other matrix operations, and
return to the question of layouts in the next section.
The algorithm is due to Cannon [20] and is well suited for computers
laid out in a square
mesh, i.e. where each processor communicates
most efficiently with the four other processors immediately north,
east, south and west of itself. We also assume the processors
at the edges of the grid are directly connected to the processors on
the opposite edge; this makes the topology that of a two-dimensional
torus. Let
be partitioned into square subblocks
as before, with
stored on processor
. Let
and
be partitioned similarly. The algorithm is given below. It
is easily seen that whenever
and
`meet' in
processor
, they are multiplied and accumulated in
;
the products for the different
are accumulated in different
orders.
Figure 1
illustrates the functioning of this algorithm
for
.
.
View Figure
The time spent by this algorithm is computed as follows. Assume we multiply
-by-
matrices on an
-by-
processor mesh, where for simplicity
divides
evenly. Assuming messages are sent between nearest neighbors
only, and that a processor can only send a single message at a time, the total
number of messages sent (in parallel) is
, the total number of words
sent (in parallel) is
, and the total number of
flops (in parallel) is
.
(See exercises 3, 4, 5, and
6.)
A variation of this algorithm suitable for machines that are
efficient at spreading subblocks across rows (or down columns)
is to do this instead of the preshifting and rotation of
(or
)
[21].
Cannon's algorithm may also be easily adapted to a hypercube [3].
The simplest way is to embed a grid (or two-dimensional torus) in a hypercube,
i.e. map the processors in a grid to the processors in a hypercube,
and the connections in a grid to a subset of the connections in a hypercube
[23][22]. This approach (which is useful for more than matrix
multiplication) uses only a subset of the connections in a hypercube, which
makes the communication slower than it need be. Several
sophisticated improvements on this basic algorithm have been developed
[25][24], the latter of which fully utilizes the
available bandwidth of the hypercube to reduce the number of messages
sent by a factor of 2 and the number of words sent by a factor of nearly
. This assumes each processor in the hypercube can send messages
to all its neighbors simultaneously, as was the case on the CM-2.
If the architecture permits us to overlap communication and computation,
we can save up to another factor of two in speed.
In this section we have shown one can optimize matrix multiplication in a series of steps tuning it ever more highly for a particular computer architecture, until essentially every communication link and floating point unit is utilized. The algorithms are scalable, in that they continue to run efficiently on larger machines and larger problems, with communication costs becoming ever smaller with respect to computation. On the other hand, let us ask what we lose by optimizing so heavily for one architecture. Our high performance depends on the matrices having just the right dimensions, being laid out just right in memory, and leaving them in a scrambled final position at the end (although a modest amount of extra communication could repair this). It is unreasonable to expect users, who want to do several computations of which this is but one, to satisfy all these requirements. Therefore a practical algorithm will have to deal with many irregularities, and be quite complicated. Our ability to do this extreme optimization is limited to a few simple and regular problems like matrix multiplication on a hypercube, as well as other heavily used kernels like the BLAS, which have indeed been highly optimized for many architectures. We do not expect equal success for more complicated algorithms on all architectures of interest, at least within a reasonable amount of time. Also, the algorithm is highly tuned to a particular interconnection network topology, which may require redesign for another machine. In view of this, a number of recent machines try to make communication time appear as independent of topology as possible, so the user sees essentially a completely connected topology.