MultiplierDoc
XEROX PARC, MAY, 1987 
The Tamarin Multiplier

Mark Ross

©  Copyright 1987 Xerox Corporation. All rights reserved.

Abstract: This report discusses the theory and implementation details of the Tamarin multiplier.  It is a sequential multiplier 
which operates on four bits per cycle and can be clocked at twice the cycle frequency of the processor.  Either the upper or lower 
portions of the result can be computed and provisions have been made to facilitate the multiplying of mantissas for floating point 
computations.   The resulting layout is a 40 bit data path.     

XEROX            Xerox Corporation
                Palo Alto Research Center
                3333 Coyote Hill Road
                Palo Alto, California 94304

DRAFT  For Internal Xerox Use Only  DRAFT
1. Theory
    All multipliers operate by generating partial products and summing these to form the result.  They are distinguished by how the 
    partial products are generated (e.g., two-bit Booth encoding) and by whether they are combinatorial or sequential.  Booth 
    encoding is a technique wherein groups of bits can be interrogated to determine the appropriate partial products instead of 
    proceeding one bit at a time.  
    The Tamarin multiplier is a sequential multiplier and uses a pair of two-bit Booth encoders to process four bits at a time.  To 
    increase speed, data movement is pipelined along with the control signals.   Only carry save adders are used on the 
    intermediate products so that carry propagation need take place only once at the very end after all of the work is done.   
2. Implementation Details
    Those wishing to follow all of these details should have a copy of the complete multiplier schematics handy.  From the previous 
    description, the components necessary to form the multiplier are registers to hold the multiplicands and products, a Booth 
    encoder and shifter to generate the partial products, and a set of adders to perform a reduction on the generated partial 
    products.  This is exactly what you see on the first page of the schematics.  Note that no overflow detection is present but 
    thoughts on what needs to be done to include this are in the conclusion.  
2.1 Interface Issues
      Those unfamiliar with the internal structure of the Tamarin processor need know something about the bus structure assumed by the 
    multiplier.  Three system busses are present: d1, d2, and r.  They represent the pair of operands fetched from the memory (d1 
    and d2) and the result bus (r).  The d1 operand is encoded and should be the smaller magnitude if this is known apriori and if 
    the state machine can take advantage of this.    
    In addition to the d1 and d2 operands, there are busses for fltd1 and fltd2 which are the floating point representations of d1 and 
    d2.  These should be wired up to shift the mantissa to the correct position and to insert the hidden bit.  Two control lines, 
    EUControl and state complete the top level interface (nEUControl is just the complement of EUControl).  Seven bits make up the 
    EUControl signal to specify the particular unit and operation being performed.  A summary of the relevant bits are shown below 
    for reference (a complete listing of the bit encodings can be found on [Phylum]<CTamarin>Documentation>EUControlSpec.tioga).  
        EUControl[0:2] =>
            110    Load Multiplier 
            111    Unload Multiplier (EUControl[3:6] are all don't cares).
        EUControl[3:6] =>
            x00x    Load signed 32 bit number 
            x01x    Load unsigned 32 bit number
            x1?x    Load Floating Point number
                ? should be set to 0 or 1 depending on whether or not mantissa is padded by a 0.
                    Use 0 if it is not padded and 1 if it is padded.

    A state signal is also present.  State must come from an external state machine to control the internal sequencing of the 
    multiplier.  Encodings for the state machine are shown below:
        state[0:1] =>
            00    done
            01    doingLower
            10    doingUpper
            11    not allowed (no valid results for this combination)

    An example of a proper state transition sequence would be to perform the load while in state "done".  The next 8 states should be 
    either "doingLower" or "doingUpper" and a "done" should then come true and remain true until another load takes place.  Eight 
    "doingLower" or "doingUppers" is the maximum number of intermediate states required and could be less if the numbers being 
    multiplied have a precision less than the full 32 bits (an n bit signed number requires only n/4 cycles to complete).  
    The last detail about state concerns how unsigned multiplies are accomplished.  These multiplies are important only for the upper 
    product.  Those familiar with the Booth algorithm know that it always assumes that the input numbers are signed (whether they 
    are or not).  When they are not signed, it is necessary to correct the sign by adding in the d2 input to the result already 
    obtained.  This can be accomplished by simply adding a state of "doingLower" after the 8 states of "doingUpper."
2.2 Timing
    Timing is based on a single clock (which in the Tamarin Processor is actually the double frequency clock) with positive edge 
    triggered registers used throughout.  As stated before, it takes one cycle to perform the load, a maximum of 8 cycles to 
    accumulate the partial products, one or possibly two cycles to perform the final add, and one cycle to unload the result.    
    Inside Tamarin, whose basic cycle speed is half of the clock frequency, it takes one cycle to load the inputs, a maximum of 
    four cycles to accumulate the products, and one cycle to unload the result (the final add is included in this time) for a total 
    of six cycles worst case.  
2.3 Functional Units
    A description of the individual functional units is contained in the sections which follow and starts at the top of the first sheet 
    of the schematics.    
2.3.1 Booth Encoder
    The Booth encoder consists of a register to hold the d1 operand and a section of logic to encode the bit groupings to generate the 
    pos/neg signal, the zero signal, and the times 2 signal.  Booth encoding looks at 3 bits at a time (one being the overlap 
    between bit pairs) and since the d1 operand is right shifted four bits every cycle, the overlap bit between nibbles is stored 
    in a separate register.  
2.3.2 A-Register
    This is used to store the multiplicand which will be shifted.  In the case of "doingLower," this value is left shifted by four bits 
    every cycle as the multiply progresses.  No movement takes place if "doingUpper."  Overflow logic would need to interrogate the 
    four bits which are left shifted out of the A-Register (and also the sign of the result). Sign extension is performed on the  
    most significant side of this register.
2.3.3 Shifter
    Bi-directional shifting is performed by this unit.  When "doingLower," the partial products are left shifted while they are right 
    shifted when "doingUpper."  
2.3.4 Adders and Product Registers
    The summing of the generated and stored partial products takes place in this section.  Two full adders (Carry Save Adders) and a 
    half adder to perform all of the necessary magic.   It should be clear why the 2 CSAs are required (takes 4 products down to 2) 
    but may be less clear on what the half adder is doing.  During "doingLower," the products are simply recirculated each cycle.  
    This is different while "doingUpper."  Then it is necessary to right shift the stored products by four bits every cycle before 
    accumulating with the new partial products.  Since information is being shifted off of the right side, these bits must be 
    checked to determine if they will contribute to the overall result.  Any necessary adjustment is made possible by the half 
    adder.  
2.3.5 Carry Propagate Adder
    The two remaining values after all of the partial products are added must be summed with a carry propagate adder to form the final 
    product.  This is done by the MAdder.  It is a 36 bit adder.    
3. Conclusions
    A high performance serial multiplier has been designed and simulated at the schematic level.  It forms a 40 bit data path (with a 
    bit pitch of 90 microns per bit).  No overflow is detected but it should be straightforward as to how to add this feature.   
    The four bits which shift off of the end must be checked to see that while partial products are being generated, all have the 
    same value and are equal to the sign of the original d2 value.  In addition, the sign of the final result should be checked to 
    insure against overflow of the final addition.