% Proposal.TeX
\title{An Architecture for High-Performance Single-Chip VLSI Testers}
\author{James A. Gasbarro, Richard M. Barth}
\documentstyle[12pt]{article}
\begin{document}
\maketitle
\newcommand{\stamp}{Printed on \today.}
\newcommand{\voh}{\(V←{OH}\)}
\newcommand{\vol}{\(V←{OL}\)}
\newcommand{\micron}{\(\mu\)}
\newcommand{\micronsq}{\(\mu^{2}\)}
\newcommand{\micro}{\(\mu\)}
\newcommand{\ram}{{\sc ram}}
\newcommand{\rams}{{\sc rams}}
\newcommand{\dram}{{\sc dram}}
\newcommand{\IPFig}[4]{\begin{figure}
\centering
\makebox[#2][l]{\rule{0pt}{#3}\verb"<==<#1.ip<"}
\caption{#4}
\label{#1}
\end{figure}Figure~\ref{#1}}
\section{Executive Summary}
Testing is an important factor in the production of useable custom integrated circuits. Verification of the functional and {\sc AC} parametric performance characteristics of a device are usually performed on large and expensive test systems. This proposal presents a new approach to tester architecture which seeks to greatly reduce both the size and cost of these systems. The principal idea is to base the tester design on the same high-density technology as that of the devices to be tested. Through the use of novel test vector compression techniques and closed-loop timing calibration methods it is possible to achieve high performance and high density in a CMOS technology. The proof is the implementation of a single-chip multi-channel tester which has the size and cost attributes of the very low-end testers, yet implements many of the features found on only the most expensive machines.
The high level of integration achieved results in a number of other advantages as well: the close proximity of the tester to the test device eliminates most of the signal transmission and loading issues encountered in larger systems; the extremely compact size enables in-circuit probing and performance analysis of the test device without custom fixturing; finally, and perhaps most importantly, by implementing the tester in the same technology as that of the device to be tested, future upgrades of the tester capability can evolve along with the capabilities of the subject devices.
As bad as the situation is presently, it is likely to worsen as ASIC development costs continue to drop while performance increases. The result will be widening disparity between design and testing costs. The fundamental problem is that the current generation of testers are based on speed-aggressive technologies, such as Emitter Coupled Logic, to achieve the performance required to test ASICs. Due to thermal and device limitations, as well as market forces, such bipolar logic families are not experiencing the same growth in terms of integration density and speed as their MOS counterparts. The result is that it is difficult to track the performance improvement of the device under test by simply scaling the devices used in the tester.
This proposal focuses on a new approach to tester architectures. The tester design is based on the same technology as that of the devices to be tested. Ordinarily this would seem difficult, if not impossible, since one would not expect a system as complex as a tester to have performance characteristics superior to any other device fabricated in the same technology. However this proposal demonstrates that this goal is achievable through novel architecture and circuit design.
\section{Introduction}
Since the early 1960's when the first integrated circuits were produced, a large number of different test systems have been developed. The earliest testers were fairly simple descendants of board-level testers. It was not long, however, before differences in board and integrated circuit testing caused the two types of testers to take on separate identities. Diversification in IC product lines resulted in further subdivision of the integrated circuit tester field into specific product categories. General purpose digital testers were developed for the growing SSI (Small Scale Integration) and MSI (Medium Scale Integration) market. Memory testers were devised with special purpose pattern generators to meet the needs of testing array structures. More recently mixed analog/digital testers have appeared for testing a variety of products which combine linear and/or high voltage circuits with digital circuits on the same die. Each of these market segments faces its own challenges: general-purpose digital testers must track the test requirements of increasing complex VLSI devices; memory testers must keep pace with the increasing speed and density requirements of new \ram\ technologies; and the mixed analog/digital testers must improve in their I/O flexibility and signal analysis capabilities. The general-purpose digital field though, is in many ways the most challenging. Here, nearly all aspects of the tester architecture are changing at exponential rates. Ten years ago the state-of-the-art in LSI (Large Scale Integration) technology was 50K transistors/chip, clock speeds of around 5 MHz and pin-counts which rarely exceeded 64. By the year 2000, as digital IC technology matures from VLSI (Very Large Scale Integration) to ULSI (Ultra Large Scale Integration), the number of transistors on a chip will exceed 100 million, new packaging technology will push pin counts into the thousands, and shrinking geometries will allow devices to operate at frequencies well in excess of 100 MHz. Coupled with these increases is the fact that these devices will encompass the memory and analog domains as well. It is not at all uncommon even today to find large embedded memories or analog I/Os in devices which are primarily digital. These demands of state-size, speed, pin-count, and I/O flexibility conspire to make the problem of building general-purpose testers quite formidable.
Tester architectures can be divided into two major components, the data generator and the pin electronics. The data generator produces the digital stimulus vectors for the {\em Device Under} Test (DUT), while the pin electronics is responsible for formatting these vectors to produce a waveform of the desired shape and timing, and for sampling the DUT outputs at the desired time.
\section{Pin Electronics}
In most test systems there is a separate printed circuit card for every one or two pin electronics channels. In a tester with several hundred pins, it is easy to see why the pin electronics portion of a tester represents a significant fraction of the overall system cost. In many tester systems resources such as timing generators and voltage sources are shared among pins, restricting the flexibility of the tester, but lowering its cost. Another cost reducing technique is to provide individual input and output pin types rather than a single general I/O pin. Testers of this class also frequently provide hardwired signal groups where all pins of the group have the same timing characteristics, in order to reduce the number of timing generators in the system. In such systems it is up to the user to manually wire signals from the Input/Output groups to the proper DUT pins. Even in medium range machines, the most expensive resources, such as timing generators, are often shared among pins, requiring complex switching matrices to distribute clock edges. Machines of this nature are said to have a {\em shared resource} architecture. At the high-end of the tester spectrum are machines which provide pin electronics with a general I/O architecture that have little or no sharing of resources. Since each pin is equivalent in power to every other pin, the organization is known as a {\em tester-per-pin}, or simply a {\em per-pin} architecture.
The term {\em pin electronics} has traditionally referred only to that portion of the tester system which is pin specific. In shared resource systems this implies three major sub-circuits, the output driver, the input comparator and a loading device. The output driver determines the output high ($V←{OH}$) and output low ($V←{OL}$) drive levels of the tester, while the input comparator determines the tester input sample threshold. Some testers have multiple input comparators which can be used either for measuring the \voh\ and \vol\ of the Device Under Test (DUT) or for measuring the output risetimes of the DUT. A programmable current load device is sometimes included in the pin electronics for simulating the effects of a bipolar load on an output of the test device.
In a per-pin architecture, pin electronics takes on a broader meaning. In such a system, the functions of edge timing generation and formatting are encompassed in the pin electronics in addition to drive/sense and loading functions. Formatting is a means for obtaining higher I/O bandwidth to the DUT and also for reducing the size of the high speed memory necessary to hold the test vectors. The format unit takes the drive level supplied by the test vector and combines it with the output of one or more timing generators to produce a pulse waveform of the desired shape. The position and duration of the pulse within the test cycle are controlled by the outputs of individual timing generators. Some of the standard format modes are Non-Return to Zero (NRZ), Return to Zero (RZ), Return to One (RO), and Return to Tri-State (RT).
The timing generators provide a clock edge at the desired point during the tester cycle. The quality of the placement of these edges is an important aspect of the tester performance, particularly in production oriented testers. Current systems have poor performance primarily due to the large number of gate delays and physical components between the timing generators and the DUT. Gate counts as high as 50 are not uncommon, nor are tens of feet of interconnect path. Each such element introduces drift and jitter terms which limit the maximum attainable accuracy. A solution to this problem is to integrate as much as possible of the pin electronics circuitry. By nature of their small size and uniform thermal characteristics, integrated circuits are an ideal means for reducing component induced errors.
\section{Data Generator}
The concept of the data generator for a high-speed tester seems simple enough: a data generator is a device which produces a single digital test vector that specifies the drive and response data for each pin during every tester cycle. The difficulty comes when size and speed constraints are placed on the system. Most testers store three to four bits per pin in the vector memory to specify output drive level, expected input level, whether the pin is input or output, and whether or not to signal an error if the input compare operation fails. A state-of-the-art tester with 512 pins would therefore require a memory word width of 2048 bits. To test the most complex VLSI devices, the tester should be capable of delivering test sequences as long as one million vectors. Furthermore, if the tester is to cycle at speeds up to 100 MHz, high-speed (but low-density) ECL or GaAs \rams\ would be necessary to supply the required bandwidth. It is easy to see how a data generator can stretch the limits of available memory technology. Many approaches have been taken to alleviate the constraints on this problem, all aimed at reducing the required amount of high-speed memory.
Schemes which have been implemented for reducing the memory size requirement for data generators fall into four main classifications: reduced functionality, memory overlay, algorithmic generation, and general compression. Some low-end testers reduce data generator word width by simply eliminating functionality. Pins of the tester are made to be either fixed direction, or unmaskable in order to reduce the number of high-speed memory chips. Such testers represent the least elegant approach to the problem since they reduce the potential tester applications. The memory overlay schemes, which are somewhat more general, employ a small amount of very high-speed \ram, a larger amount of lower-speed memory and some technique for loading the fast \ram\ from the slower one. The third class of memory reduction scheme takes advantage of the algorithmic nature of many test sequences. Memory testers are the classic example of such generators. In memory testers the test patterns must exhaustively search the DUT memory array for sensitivities between adjacent bits or lines of the array. Relatively little work has been done in the realm of general purpose compaction schemes for testing purposes. This type of reduction scheme takes advantage of the spatial and temporal coherence in the test vectors to reduce the storage requirements.
\section{Limits of Current Systems}
Users of the current generation of integrated circuit testers are becoming increasingly aware of their limitations for testing VLSI devices. As a result of simultaneous increases in both pin-count and device speed, testers are becoming increasingly large and expensive. Size is important not only because of the cost implications, but also because of the overall performance degradation due to the increased length of the transmission line between the pin electronics and the DUT. When testing ECL or GaAs devices, which have low impedance outputs, it is easy for the tester to maintain a matched impedance environment for both input and output signals to the DUT. Both the tester and the DUT are capable of driving the terminated transmission line which connect them. The only electrical effect of the transmission line is a pure delay which can be calibrated and compensated before actual testing is performed. This is not the case however when testing typical CMOS VLSI devices. Because of the limited drive capability of such devices, which is in turn related to their high pin counts, it is typically not possible for the DUT to properly drive a 50 or even a 90 ohm terminated line.
The physical size of the test system, particularly the pin electronics, has other influences on the utility of the test system in addition to the electrical problems discussed above. In failure analysis work for example, it is necessary to connect the test head to the DUT with a good electrical signal environment and still provide microscope and micro-probe access to the circuit. Even with custom fixturing, this is an extremely difficult task. In the near future, device geometries will shrink below one micron making mechanical micro-probing obsolete. When this happens more exotic techniques such as electron beam probing will become necessary. In this type of system the test device must be isolated in a vacuum chamber, further complicating the mating of the tester and DUT. The need for smaller testers is evident.
There is clearly a need for testers which are greatly reduced in size from what is currently available. Ideally, all of the high bandwidth components of the tester should be on the same scale as the DUT itself. The following chapter discusses the basic problems encountered when taking this approach to tester design. It then introduces the basic concepts for a fully integrated CMOS tester.
\section{Don't Think Big}
If one were to survey the market for high-end test systems available today, he would find many mainframe sized testers available costing on the order of a few million dollars each. The basic structure of these machines is essentially the same: several racks full of electronics to generate the test vectors, connected via a large number of high speed cabling to a bulky pin electronics head which in turn interfaces to the single small test device. The obvious question is, ``is all of this really necessary?'' Indeed, is it the case that designers of such equipment have built themselves a self-fulfilling prophecy? In setting out to create a world-class tester they have assumed, based on previous designs, that it would be a physically large machine. The large size imposes a number of problems having to do with signal skew, loading, accuracy, and distribution that require additional hardware and software subsystems to correct. As the number of subsystems increases the communication and control costs increase as well, leading to further system complexity. In the end, the tester must be a physically large machine to accommodate all of the overhead hardware, thus justifying the designers' initial assumption. In order to test this hypothesis, the initial assumption must be changed. What would a VLSI test system look like that was designed to be as {\em small} as possible?
\section{Think Small}
In order to build the smallest possible system, the obvious option is to employ high density VLSI technologies. The number of transistors that can be implemented in a bipolar technology is fairly limited due to power dissipation, so a higher density technology like CMOS seems to be the likely candidate. The main disadvantage of choosing CMOS is that it is not a particularly fast technology. If the principal application of the tester is for testing CMOS ASIC components, then the technology of the tester is only on par with that of the test device. This can be turned into an advantage though, for if the desired performance is achievable, then the tester technology relies only on the same technology as the device to be tested. No other, more exotic technology is required, thus allowing the tester technology to track that of the DUT. Let us put aside the speed difficulty for now though, by assuming that it can be handled through clever circuit design and parallelism.
High-speed interconnect in any system is a difficult problem. In commercial test systems one typically finds large masses of bulky coaxial cables tying various pieces of the system together. This type of interconnect cannot be tolerated in a test system that is to be as small as possible. The best way to avoid it is to constrain all of the high-speed components of the system to fit together on a single printed circuit card where signal impedances can be carefully controlled and properly terminated. Another advantage of placing all of the high-speed portion of the system on a single board is that it drives the system towards a clean, low speed interface to the outside world. Presumably, the system is controlled by a host mainframe, workstation, or PC, so eliminating all real-time constraints from the host's interface ensures that this part of the system can be simple and low cost as well.
With the single board system concept in mind, attention can now be focused on the individual components that make up the building blocks of the system. There are three main functions that these components must implement: timing generation, pin drive/sense, and vector storage. The ideal device for a single-chip tester would incorporate all three functions of vector storage, timing generation, and pin electronics for several tester channels on a single chip. In addition, it would contain the necessary interface and control logic so that it would interface directly, via a low speed interface, to the host computer and directly, through short traces, to the DUT. A number of these devices would be used in parallel to build testers of arbitrary width. A small number of clock and calibration signals would be common to all elements of the system, but since they are relatively few in number, a good deal of care could be taken in their generation and distribution. The master timing source would either be a custom device or a small amount of board level logic since only one copy would be required. The system would easily fit on a probe card for wafer analysis and in-circuit debugging. It could also be added as a single board to a workstation for packaged part testing. Such a tester could have good performance characteristics, yet still be made small and low cost.
\section{System Overview}
The goal in designing Testarossa was to produce a single device that could be used as a building block for constructing high speed, high pin-count, VLSI test systems. \IPFig{Board}{150mm} {200mm}{Single board integrated test system} illustrates the configuration for a 256 pin test system using the Testarossa chip. The system consists of sixteen tester chips, each providing sixteen I/O channels. The chips are arranged in a circular fashion around the central DUT thereby equalizing the lead lengths, while at the same time maintaining a total trace length of less than ten centimeters. This limits the time-of-flight of a signal between DUT and pin electronics to only a few hundred picoseconds, obviating the need for terminated transmission lines. The short trace length also provides a stray load capacitance on the DUT outputs on the order of only a few picofarads. This ensures high-fidelity waveform transmission for both driven and sensed DUT signals with a minimum of signal loading. The Testarossa chip is composed of three main parts, the pin electronics, the decompressor and associated sequence control logic, and the vector storage \ram\ (\IPFig{TRBD} {150mm}{105mm}{Block diagram of Testarossa chip.}).
\section{Results}
The layout and simulation of Testarossa were completed in about six months and the device was fabricated through the MOSIS implementation service. The first silicon was nearly 100\% functional, which is a testimony to the methodology of schematic entry, layout versus schematic verification, and simulation, simulation, and more simulation. The only defect in the chip was a timing error in the \dram\ which caused some crosstalk between the data columns of the \ram. This prevented more than one column of the \dram\ from being used. Fortunately, the column selects were derived from the high order address bits, so that the first 128 words in the \ram\ could be used reliably. This provided enough vector storage space to allow thorough testing of the decompressor.
The maximum speed of the device was just over 17 MHz, which was somewhat slower than the anticipated 25 MHz. The limiting factor was the poor access time of the \dram, which was related to the crosstalk problem mentioned above. When the decompressor was operated in a debugging mode that bypassed the \dram\ and executed a four instruction loop out of the Ram Write Data register, a maximum speed of 33 MHz was attained. This is indicative of what could be achieved with a better \ram\ technology.
The DUT waveform exhibited good edge and timing resolution characteristics. The rising edge kink effect observed in the experimental pin electronics chip was eliminated in this device yielding an unloaded rising edge speed of 3 ns. The edge resolution was 600 ps., the same as that of the earlier device, providing an overall calibrated accuracy of about 1 ns. Of 27 parts tested, four were fully functional for a 14\% yield.
\section{Future Work}
There are many ways in which the architecture can be improved and extended as new technologies become available. The current 2\micron\ Testarossa implementation is rather behind the state of the art in terms of process technology. A 0.8\micron\ process would provide a four-fold increase in effective die area while still allowing a 20\% shrink in the actual die dimensions. One way to use the additional die area would be to simply replicate the existing logic to provide more DUT channels per chip, allowing an even denser system to be built. However, this does not seem to be the most advantageous use of the added silicon area. A better approach would be to use the space to increase the on-chip resources for improving system capabilities. For example, the vector storage \ram\ size could be increased. This is the most likely candidate for consuming the majority of the added area, since the 10K-vector storage capacity of the current implementation is rather limited, compared to the majority of testers available. In Testarossa, the \dram\ occupies only 14\% of the total die area. If all of the additional space were devoted to vector storage, the storage capacity would increase by more than a factor of 20, providing over 200K vectors on-chip (compressed average). As the percentage of die area devoted to vector storage increases, the benefit of the vector decompressor becomes more evident. The ratio of vector \dram\ to decompressor areas in Testarossa is approximately 1:1. Given that the average compression factor is about 5, the effective compression ratio is only 2.5:1, since the \dram\ area could be doubled if the decompressor were not present. The effect of a process shrink is to increase the \dram\ size while decreasing the decompressor area making the effective compression ratio much closer to the average. Even greater storage capacity could be achieved by employing a technology which provided a real \dram\ storage cell, such as the trench capacitor process, as opposed to the poly-diffusion cell capacitor used in Testarossa. This would improve the bit-density by another factor of eight to ten, pushing the vector storage capacity well into the mega-vector regime. Table~\ref{RamComparison} shows a comparison of static and dynamic \ram\ implementations contrasting the generic 2\micron\ digital technology used for Testarossa with the best \ram\ technologies currently available.
\begin{table}[htp]
\centering
\begin{tabular}{||r|c|c|r|r||} \hline
Technology & Type & Cell & Size (\micronsq, est.) & Cycle (ns, est.) \\
\hline
2\micron & Static & 6T & 1600 & 15 \\
2\micron & Dynamic & Poly-Dif & 300 & 35 \\
0.8\micron & Static & 4T & 70 & 8 \\
0.8\micron & Dynamic & Trench & 15 & 30 \\
\hline
\end{tabular}
\caption{Comparison of static and dynamic \ram\ implementations}
\label{RamComparison}
\end{table}
An alternative approach for the vector storage is to use static \ram\ instead of \dram. The major tradeoff between these two choices is storage density versus access speed. The data path of the tester and decompressor is a fairly simple pipelined architecture, so the limiting factor in the tester performance is the \ram\ cycle time. By going to a fast static \ram\ process, the performance of the system could be dramatically improved, but at the expense of reduced vector storage capacity.
Advanced process technology can be applied to other aspects of the architecture as well. Bipolar transistors, such as those available in a Bi-CMOS process would help in several areas of the design. For example, the higher gain of the bipolar devices would allow the current drive of the output pad driver transistor to be improved while at the same time reducing the capacitive load of the driver on the input. This would enable faster output rise-times and lower DUT output loading. Bipolar transistors could also be used in the calibration phase comparator and input sample comparator. The higher gain-bandwidth product of the bipolar devices would result in greater timing accuracy.
In addition to architectural improvements to the Testarossa device, some additional work remains to be completed before a complete multi-hundred pin tester system can be assembled. The bulk of the work is in the design of the reference clock generator circuit. The requirements of this circuit are that it produces a square waveform at the tester cycle frequency with a phase relationship to the system clock that is programmable with sub-nanosecond resolution and accuracy over the range of a microsecond or so. It is no surprise that these are the same requirements as those of the tens or hundreds of individual clock generator circuits used in commercial testers. It should be relatively easy then, to leverage off of these earlier designs to implement a high quality generator using a semi-custom ECL device. Alternatively, there are CMOS approaches employing phase locked loop principles that could be employed that would keep the design of the entire system in one technology.
\section{Summary}
This proposal has presented a new architecture for building high-performance single-chip VLSI testers. New circuit and system designs for the pin electronics have been shown which solve the problems of achieving high timing accuracy for output edge placement and input sample capture in a relatively slow base technology such as CMOS. These techniques provide sufficient density advantages over traditional methods that they enable the implementation of a multi-channel device with true tester-per-pin characteristics. A reduction in the size of the data generator portion of the tester has also be achieved through the use of data compression. All of these ideas have been brought together in the design and successful implementation of Testarossa: a single-chip 16-channel tester containing all of the functions of pin electronics, vector storage, decompression, and acquisition logic.
\end{document}