### Abstract:

Modular multiplication is a mathematical operation fundamental to the RSA cryptosystern,
a public-key cryptosystem with many applications in privacy, security, and authenticity.
However, cryptosecurity requires that the numbers involved be extremely large,
typically ranging from 512-1024 bits in length. Calculations on numbers of this magnitude
are cumbersome and lengthy; this limits the speed of RSA.
This thesis examines the problem of speeding up modular multiplication of large numbers
in hardware, using the classical (add-and-shift) multiplication algorithm. The problem is
broken down, and it is shown that the primary computational bottleneck occurs in the
modular reduction step performed on each cycle. This reduction consists of an integer
division step, a broadcast step, and a multiplication step.
Various methods of speeding up these steps are examined, both for the special case of
radix-2 multipliers (those shifting a single bit at a time) and the general case of radix-2r
multipliers (those shifting r bits on every cycle.) The impacts of these techniques, both on
cycle time and on chip area, are discussed. The scalability of these systems is examined,
and several implementations of modular multiplication found in the literature are analyzed.
Most significantly, the technique of pipelining of modular multipliers is examined. It is
shown that it is possible to pipeline the modular reduction sequence, effectively eliminating
the cycle time's dependence on either the size of the modulus, or on the size of the radius.
Furthermore, a technique for constructing such multipliers is given. It is demonstrated that
this technique is scalable with respect to time, and that pipelining eliminates many of the
disadvantages inherent in previous high-radix implementations. It is also demonstrated that
such multipliers have an area requirement which is linear with respect to both radix and
modulus size.