4.2 Matrix Multiplication on a Distributed Memory Machine

In this section it will be convenient to number matrix entries (or subblocks) and processors from

instead of

On distributed memory machines the cost model is more complicated than on shared memory machines, because we will need to worry about the data layout, or how the matrices are partitioned across the machine. This will determine both the amount of parallelism and the cost of communication. Recall from the chapter on Computer Architecture

that the cost of sending a message of

words from one processor to another is

, where

is the start-up cost or latency, and

is the per-word cost, or reciprocal of bandwidth. Therefore to assess the cost of an algorithm we need to count the number of floating point operations, the number of messages sent (at a cost of

per message), and the total length of messages sent (at a cost of

per word). We begin by showing how best to implement matrix multiplication without regard to the layout's suitability for other matrix operations, and return to the question of layouts in the next section.

The algorithm is due to Cannon [20] and is well suited for computers laid out in a square

mesh, i.e. where each processor communicates most efficiently with the four other processors immediately north, east, south and west of itself. We also assume the processors at the edges of the grid are directly connected to the processors on the opposite edge; this makes the topology that of a two-dimensional torus. Let

be partitioned into square subblocks

as before, with

stored on processor

. Let

and

be partitioned similarly. The algorithm is given below. It is easily seen that whenever

and

`meet' in processor

, they are multiplied and accumulated in

; the products for the different

are accumulated in different orders.

The time spent by this algorithm is computed as follows. Assume we multiply

-by-

matrices on an

-by-

processor mesh, where for simplicity

divides

evenly. Assuming messages are sent between nearest neighbors only, and that a processor can only send a single message at a time, the total number of messages sent (in parallel) is

, the total number of words sent (in parallel) is

, and the total number of flops (in parallel) is

A variation of this algorithm suitable for machines that are efficient at spreading subblocks across rows (or down columns) is to do this instead of the preshifting and rotation of

(or

) [21].

Cannon's algorithm may also be easily adapted to a hypercube [3]. The simplest way is to embed a grid (or two-dimensional torus) in a hypercube, i.e. map the processors in a grid to the processors in a hypercube, and the connections in a grid to a subset of the connections in a hypercube [23][22]. This approach (which is useful for more than matrix multiplication) uses only a subset of the connections in a hypercube, which makes the communication slower than it need be. Several sophisticated improvements on this basic algorithm have been developed [25][24], the latter of which fully utilizes the available bandwidth of the hypercube to reduce the number of messages sent by a factor of 2 and the number of words sent by a factor of nearly

. This assumes each processor in the hypercube can send messages to all its neighbors simultaneously, as was the case on the CM-2. If the architecture permits us to overlap communication and computation, we can save up to another factor of two in speed.

In this section we have shown one can optimize matrix multiplication in a series of steps tuning it ever more highly for a particular computer architecture, until essentially every communication link and floating point unit is utilized. The algorithms are scalable, in that they continue to run efficiently on larger machines and larger problems, with communication costs becoming ever smaller with respect to computation. On the other hand, let us ask what we lose by optimizing so heavily for one architecture. Our high performance depends on the matrices having just the right dimensions, being laid out just right in memory, and leaving them in a scrambled final position at the end (although a modest amount of extra communication could repair this). It is unreasonable to expect users, who want to do several computations of which this is but one, to satisfy all these requirements. Therefore a practical algorithm will have to deal with many irregularities, and be quite complicated. Our ability to do this extreme optimization is limited to a few simple and regular problems like matrix multiplication on a hypercube, as well as other heavily used kernels like the BLAS, which have indeed been highly optimized for many architectures. We do not expect equal success for more complicated algorithms on all architectures of interest, at least within a reasonable amount of time. Also, the algorithm is highly tuned to a particular interconnection network topology, which may require redesign for another machine. In view of this, a number of recent machines try to make communication time appear as independent of topology as possible, so the user sees essentially a completely connected topology.

Mar	APR	May
	01
2017	2018	2019