MultiplierDoc
XEROX PARC, MAY, 1987
The Tamarin Multiplier
Mark Ross
© Copyright 1987 Xerox Corporation. All rights reserved.
Abstract: This report discusses the theory and implementation details of the Tamarin multiplier. It is a sequential multiplier which operates on four bits per cycle and can be clocked at twice the cycle frequency of the processor. Either the upper or lower portions of the result can be computed and provisions have been made to facilitate the multiplying of mantissas for floating point computations. The resulting layout is a 40 bit data path.
XEROX   Xerox Corporation
    Palo Alto Research Center
    3333 Coyote Hill Road
    Palo Alto, California 94304

DRAFT — For Internal Xerox Use Only — DRAFT
1. Theory
All multipliers operate by generating partial products and summing these to form the result. They are distinguished by how the partial products are generated (e.g., two-bit Booth encoding) and by whether they are combinatorial or sequential. Booth encoding is a technique wherein groups of bits can be interrogated to determine the appropriate partial products instead of proceeding one bit at a time.
The Tamarin multiplier is a sequential multiplier and uses a pair of two-bit Booth encoders to process four bits at a time. To increase speed, data movement is pipelined along with the control signals. Only carry save adders are used on the intermediate products so that carry propagation need take place only once at the very end after all of the work is done.
2. Implementation Details
Those wishing to follow all of these details should have a copy of the complete multiplier schematics handy. From the previous description, the components necessary to form the multiplier are registers to hold the multiplicands and products, a Booth encoder and shifter to generate the partial products, and a set of adders to perform a reduction on the generated partial products. This is exactly what you see on the first page of the schematics. Note that no overflow detection is present but thoughts on what needs to be done to include this are in the conclusion.
2.1 Interface Issues
Those unfamiliar with the internal structure of the Tamarin processor need know something about the bus structure assumed by the multiplier. Three system busses are present: d1, d2, and r. They represent the pair of operands fetched from the memory (d1 and d2) and the result bus (r). The d1 operand is encoded and should be the smaller magnitude if this is known apriori and if the state machine can take advantage of this.
In addition to the d1 and d2 operands, there are busses for fltd1 and fltd2 which are the floating point representations of d1 and d2. These should be wired up to shift the mantissa to the correct position and to insert the hidden bit. Two control lines, EUControl and state complete the top level interface (nEUControl is just the complement of EUControl). Seven bits make up the EUControl signal to specify the particular unit and operation being performed. A summary of the relevant bits are shown below for reference (a complete listing of the bit encodings can be found on [Phylum]<CTamarin>Documentation>EUControlSpec.tioga).
EUControl[0:2] =>
 110 Load Multiplier
 111 Unload Multiplier (EUControl[3:6] are all don't cares).
EUControl[3:6] =>
 x00x Load signed 32 bit number
 x01x Load unsigned 32 bit number
 x1?x Load Floating Point number
  ? should be set to 0 or 1 depending on whether or not mantissa is padded by a 0.
   Use 0 if it is not padded and 1 if it is padded.
A state signal is also present. State must come from an external state machine to control the internal sequencing of the multiplier. Encodings for the state machine are shown below:
state[0:1] =>
00 done
01 doingLower
10 doingUpper
11 not allowed (no valid results for this combination)
An example of a proper state transition sequence would be to perform the load while in state "done". The next 8 states should be either "doingLower" or "doingUpper" and a "done" should then come true and remain true until another load takes place. Eight "doingLower" or "doingUppers" is the maximum number of intermediate states required and could be less if the numbers being multiplied have a precision less than the full 32 bits (an n bit signed number requires only n/4 cycles to complete).
The last detail about state concerns how unsigned multiplies are accomplished. These multiplies are important only for the upper product. Those familiar with the Booth algorithm know that it always assumes that the input numbers are signed (whether they are or not). When they are not signed, it is necessary to correct the sign by adding in the d2 input to the result already obtained. This can be accomplished by simply adding a state of "doingLower" after the 8 states of "doingUpper."
2.2 Timing
Timing is based on a single clock (which in the Tamarin Processor is actually the double frequency clock) with positive edge triggered registers used throughout. As stated before, it takes one cycle to perform the load, a maximum of 8 cycles to accumulate the partial products, one or possibly two cycles to perform the final add, and one cycle to unload the result. Inside Tamarin, whose basic cycle speed is half of the clock frequency, it takes one cycle to load the inputs, a maximum of four cycles to accumulate the products, and one cycle to unload the result (the final add is included in this time) for a total of six cycles worst case.
2.3 Functional Units
A description of the individual functional units is contained in the sections which follow and starts at the top of the first sheet of the schematics.
2.3.1 Booth Encoder
The Booth encoder consists of a register to hold the d1 operand and a section of logic to encode the bit groupings to generate the pos/neg signal, the zero signal, and the times 2 signal. Booth encoding looks at 3 bits at a time (one being the overlap between bit pairs) and since the d1 operand is right shifted four bits every cycle, the overlap bit between nibbles is stored in a separate register.
2.3.2 A-Register
This is used to store the multiplicand which will be shifted. In the case of "doingLower," this value is left shifted by four bits every cycle as the multiply progresses. No movement takes place if "doingUpper." Overflow logic would need to interrogate the four bits which are left shifted out of the A-Register (and also the sign of the result). Sign extension is performed on the most significant side of this register.
2.3.3 Shifter
Bi-directional shifting is performed by this unit. When "doingLower," the partial products are left shifted while they are right shifted when "doingUpper."
2.3.4 Adders and Product Registers
The summing of the generated and stored partial products takes place in this section. Two full adders (Carry Save Adders) and a half adder to perform all of the necessary magic. It should be clear why the 2 CSAs are required (takes 4 products down to 2) but may be less clear on what the half adder is doing. During "doingLower," the products are simply recirculated each cycle. This is different while "doingUpper." Then it is necessary to right shift the stored products by four bits every cycle before accumulating with the new partial products. Since information is being shifted off of the right side, these bits must be checked to determine if they will contribute to the overall result. Any necessary adjustment is made possible by the half adder.
2.3.5 Carry Propagate Adder
The two remaining values after all of the partial products are added must be summed with a carry propagate adder to form the final product. This is done by the MAdder. It is a 36 bit adder.
3. Conclusions
A high performance serial multiplier has been designed and simulated at the schematic level. It forms a 40 bit data path (with a bit pitch of 90 microns per bit). No overflow is detected but it should be straightforward as to how to add this feature. The four bits which shift off of the end must be checked to see that while partial products are being generated, all have the same value and are equal to the sign of the original d2 value. In addition, the sign of the final result should be checked to insure against overflow of the final addition.