Rick Beach, February 15, 1987 8:09:08 pm PST
Dialogue Management for Gestural Interfaces
Jim Rhyne
T. J. Watson Research Center
IBM Corporation
P. O. Box 218
Yorktown Heights NY 10598
Gestural interfaces are electronic analogues to pencil and paper. Since the effectiveness of such interfaces depends heavily on enduser familiarity with pencil markup of printed documents, the interface must conform to the user's behavior and not rely on educating the enduser. Spatial relationships among the gestural forms partially determine the syntactic interpretation of gestures, along with information about the context in the neighborhood of the gesture. Temporal grouping of gestural forms is more important than their temporal sequence. Such characteristics suggest a form of dialogue recognition in which rules do not specify temporal ordering of forms, and in which multiple parses are carried out in parallel.
Gestural interfaces are a member of the family of direct manipulation interfaces [10,23]. They employ a stylus on a tablet, and are capable of recognizing meaningful configurations of strokes, including handwritten text, pointing, and other stroke configurations which we will term gestures.
The ideal input-output medium for gestural interfaces is a thin, flat display on which the stylus is directly used. Ideally, the display/tablet package would be the size and weight of a large book and could be carried around by the enduser as an electronic clipboard. Alan Kay proposed a similar notion about ten years ago, which was called the DYNABOOK [12,13]. Kay and Goldberg also gave some consideration to the use of gesture and handwriting recognition in this project, but their interest in the development of the Smalltalk programming language led them away from further work on gestural interfaces (A. Goldberg, personal communication, December 1986).
Gestural interfaces have a number of potential advantages:
 A single gesture can be equivalent to many keystrokes and mouse actions;
 The package can be held in one hand and operated with the other, permitting its use in shops, warehouses and manufacturing facilities, or where the other hand must be free to use a telephone, calculator, etc..
 The interface is silent, facilitating its use in group meetings.
And a couple of potential disadvantages:
 Handwriting is 2-5 times slower than keyboard entry for text;
 Gestural interfaces will be more expensive than keyboard/mouse interfaces.
We have no data to predict the tradeoff of advantages and disadvantages. Our goal is to understand the human factors of gestural interfaces, the methods and costs of constructing them, and the aspects of the input and display technology which limit their use [20]. Several pilot human factors experiments have been done using pencil and paper and simulated interfaces, to help us understand the issues. In addition, these experiments have been a valuable source of gestural examples, which we need to determine the functions our interface software must perform [26]. This report addresses the software architecture of gestural interfaces as we presently understand it [21].
Dialogue Structure in Gestural Interfaces
Dialogue structure is best described at two levels: the functional structure, and the stroke (surface) structure. The functional structure defines the functional roles played by gestures. The stroke structure characterizes the types and grouping of strokes to form a gesture. The task of the dialog manager is to transform the stroke structure into the functional structure.
The fundamental element of a gesture is a stroke, i.e. a timestamped series of positions beginning with contact of the stylus or finger with the input surface, and ending when contact is broken. Strokes are initially categorized by their areal extent as pointing strokes, handwriting strokes, or gestural strokes. This classification only gives clues to the stroke function. A pointing gesture may really be the dotting of an i, or a decimal point in a number. The functional role of a stroke can only be realised by extraction of its geometric features, examination of its potential role in the rest of the gestural dialogue, and investigation of the display objects with which it may be correlated.
Functional Elements of Gesture
The functional elements correspond to those found in command languages. The functional elements consist of operations, scopes, targets, literals, and modifiers. The following examples will illustrate functions associated with strokes in a gestural interface.
1a.   1b.  1c.   1d.

$123.00 $456.00  $123.00  $123.00 $456.00 $123.00 $456.00
$ 2.00 $ 75.00 $456.00 $ 2.00 $ 75.00 $ 2.00 $ 75.00
$789.00 $123.00 $456.00 $123.00 $456.00
$ 2.00 $ 75.00 $ 2.00 $ 75.00

2a.   2b. 2c. 2d.

manly $775.23 $ 12.50
$ 99.40 33.33

3a. 3b. 3c. 3d.

12,000 1,350 now is the times 1234 123
11,500 1,075 5678 456

Figure 1. Examples of Gesture
Scoping is the act of selecting a set of objects to be acted upon. A scope is a functional element, not to be confused with a surface interaction technique such as picking or menu selection. Scopes may be specified at the surface level by describing the objects to comprise the set (e.g. all objects having a given property), by using the external names of the objects, or by pointing, directly with the finger or stylus, or indirectly with a mouse and cursor. Scoping in a gestural interface consists mainly of markings that enclose displayed objects, markings that traverse displayed objects, and markings that indicate the spatial extent of a set of displayed objects. Some examples of gestural scoping are shown in 1a-d above.
Target indication
Target indication principally takes two forms: pointing with the stylus as the location of interest, and entering a gesture such as a caret or arrow. In some cases, an operation symbol itself may indicate a target position. Examples 2a-c above show target indication in conjunction with other functional elements.
Operation specifications may be explicit gestures, or they may be implied by the group of related gestures. In 2a, the combination of the arrowhead and the handwritten letter i imply the insertion of the letter in the text. Explicit operation specifications are often symbols (3c-d), or gestures (3a-b).
Literals and modifiers
A literal element is a handwritten word or character to be added to
text or replace text. Examples of literals are seen in 2a and 2d.
Modifiers are typically handwritten words or characters, but are interpreted as parameters for the operation, as seen in example 2c.
Issues in gestural dialogue
Syntactic compression in gesture
A single stroke may contain many functional elements. The following example shows two gesture variants, one a single stroke and the other made of three strokes.
123 123
456 456
The capability of the user to specify a complete command with a single stroke is one of the significant efficiencies of gesture. Extracting the functional elements from the stroke is a complex process. Our present approach is to submit the gesture to a preliminary extraction of geometric features (e.g. the nearly closed curve, and the remaining arc) and treat these features as tokens, grouped by the fact of their being extracted from a single stroke. The three stroke variant also produces three tokens, temporally and spatially grouped. The primary dialogue rule that looks for a scope and a target and, optionally, an interconnecting arc then works in either case.
Contextual effects
The use of context (i.e. the function and structure of displayed objects in the neighborhood of a gesture) is crucially important in the interpretation of gesture. This may be seen in many of the preceding examples, especially those involving insertion, where the point of insertion is not the coordinates of the caret or arc tip, but rather the logical point of between two letters or two words, or in the proper row and column.
Context may also trade off against the preciseness of scoping gestures. Our subjects frequently cut off parts of displayed objects they intended to include in a scope, and marked over parts of material they did not intend to include in a scope. Yet, it is clear to the human observer what was intended by the scope.
Context also takes the form of application dependent metarules which control the interpretation of dialogue. We asked some subjects to specify the summation of two columns of numbers in a display using the scope gesture and summation symbol. Some of them used the sequence: scope, sum, scope, sum. Others, however, used the sequence: scope, scope, sum, sum. Had the temporal sequence been used to associate each scope with its corresponding operation, the result would have been wrong. The key to understanding this dialogue lies in the fact that each summation symbol was written in the same column and below the set of numbers intended to be summed. We will be giving a lot of attention to this topic over the next phase of our research.
Closure is an event which signals the end of a dialogue phrase. The newline or enter key is a common explicit closure signal for many keyboard based enduser interfaces. Determining closure is difficult in examples like move/copy, where the dialogues are identical except one has an additional final event. We are using an explicit closure (the user taps the pen on a closure icon) until we better understand gestural interface design.
Embedded dialogues
It frequently happens in direct manipulation interfaces, that the objects to be scoped and/or target locations are not fully visible. In this case, the enduser may use object names, or may scroll the display. Experimental evidence suggests that scrolling and other more complicated dialogues may occur within another gestural dialogue. Such dialogues are typically completed before the enduser continues the original or host dialogue, hence they are called embedded dialogues.
Embedded dialogues are difficult for the computer to handle because of the decision which must be made at each user event whether this event belongs to the present dialogue or is the beginning of a new one. One rather unsatisfactory technique is to permit embedded dialogues only at specified states in the host dialogue, e.g. scrolling is only permitted at certain states in spreadsheet dialogues. Direct manipulation interfaces may use another technique, namely, confining dialogues to particular regions, e.g. the application work area, the pull-down menu, or the scroll bars. Gestural interfaces may add the use of distinguishable gestures that indicate the start of an independent dialogue.
Our present implementation uses rules that may create several simultaneous dialogue threads. It is not easy for an interface designer to specify asynchronous dialogues, because of the difficulty of testing for potential ambiguities. We must try to build tools to assist the designer in building and debugging multi-threaded dialogues.
Software architecture for gestural interfaces
The central software component of the interface is the dialogue machine, which communicates with the other subsystems:
 the input/output subsystem, which manages the input and display devices, performs inking of the stylus trace, and converts input device manipulations into canonical interface events;
 the recognizer subsystem, comprising a trainable, pattern matching recognizer for handwriting and a feature analysis recognizer for gesture [24];
 the application subsystem.
Having two recognizers complicates the dialogue design. The preliminary classification of strokes may result in a stroke being sent to the wrong recognizer, and the dialogue must be written to handle this. Some heuristics about handwriting, i.e. the left-to-right sequence, adjacency, and baselines are built into a post-classifier routine that examines strokes to see if they belong to previously entered handwriting strokes.
Dialogue specification and processing
A command is a grouping of functional elements that is meaningful and complete, typically represented by a tree or DAG. The or DAG also captures the surface ordering of the tokens of the command through the notion of a traversal order. The parsing technology developed over the last 30 years is profoundly dependent on the traversal ordering. In gestural languages, however, the surface ordering and functional grouping may not coincide, and syntax directed parsing will not be ideal for gestural dialogues. This conclusion has been reached by others attempting to implement direct manipulation interfaces.
Our requirement for order-free, asynchronous dialogues has led us to a rule based system driving a parallel parser. The parser is similar in operation to parsers designed in the 1950s and 1960s to produce all possible syntactically valid parses. The design of our parser is completely different, however, in that semantic constraints are incorporated within the rules, and the parallel parse is employed to handle order freedom and multiple dialogue threads.
Events are timestamped and linked in temporal order, but the parser sees all events categorized by event types. When a set of events is present that satisfies the requirements of a dialogue rule, the temporal sequence links may be checked to see if the events form a temporal group. The timestamps may also be checked by the rules where the dialogue requires certain actions to be carried out within a prescribed time.
Where parallel parses are permitted, deletion of erroneous parses must be handled. In our case, the parses proceed as a kind of competition; the first parse to reach a stage of completion wins and eliminates the other parses. The execution of a rule constructs a graph linking the tokens input to the rule to the tokens built by the rule. This graph may be used to undo the rule, and alternative parses are eliminated by undoing the rules which created them. Parse elimination can be an expensive task. Careful design of the interface should minimize the number of cases in which parse elimination has cascading, exponential behavior.
The following brief example demonstrates some of the features of the dialogue rules. The application is a spreadsheet, and the task is to sum a column of numbers in a particular cell. The dialogue scenario consists of three strokes: a scope gesture, a Greek Sigma symbol, and a closeure stroke in which the user points at a closure icon displayed on the screen.
11.50 sscmd
99.44 ssop ssrange closestroke

symbol gesture

hwstroke geststroke

1 hwstroke(x) =: symbol(y):hwreco(x);
2 geststroke(w) =: gesture(z):gestreco(w);
3 symbol(y) =: ssop(v)|sspick(v):sscorr(y);
4 gesture(z) =: ssrange(u)|sspick(u):sscorr(z);
5:ssop(v), ssrange(u): closestroke(t) =:
sscmd(s):sscmd("\0x95 %a @SUM( %a . %a ) \n",
sscell(v), ssulrange(u), sslrrange(u));
6 sscmd(s):(rcsscmd(s) = 0):CLOSE;
The left side of a rule contains conditions for the rule execution, written as predicates with free variables. The free variables are used to specify coocurrence constraints and to transmit the events to the right sides of the rules. The scope of a free variable is a single rule. Conditions are matched in temporal order except where curly braces are used to delimit order free groups. The left side ends with the ``=:'' symbol.
The right side of a rule consists of actions and their resulting events, written as <event-union>:<action>, where an <event-union> is either a single event or a collection of alternative events resulting from execution of the action. Actions are presently atomic procedures in the system, written in a programming language. Later, the rule language will be extended to include specification of action procedures.
Rules 1 and 2 match handwriting and gesture stokes, and invoke the respective recognizers (actions hwreco and gestreco that create symbol and gesture events respectively). The event is created when the action is complete, and other rules may be invoked while an action is being caried out. If the action fails, the event is not created, and the rule is undone by the rule.
In rules 3 and 4, the recognized symbol and gesture are sent to a function that analyzes their placement on the spreadsheet display, sscorr. This function may generate any one of several events, e.g. ssop, sspick, or ssrange depending on the kind of symbol or gesture and its location on the spreadsheet display.
In rule 5, the ssrange and ssop events may occur in any order, so long as they are adjacent to each other. They must be followed by the closure event described earlier. The result of this rule is to create an event token, sscmd, by invoking a function sscmd that actually formats a sequence of keystrokes to be sent to the spreadsheet. This sequence of keystrokes causes the cell pointer to be moved to the cell containg the sigma, and a formula to be entered into that cell of the form @SUM(ul.rl), where ul and lr are cell addresses of the extent of the range denoted by the ssrange token.
Rule 6 tests for proper completion of the command execution. The CLOSE action eliminates any competing parses.
In this report, we have tried to describe some interesting results from a project that is not yet complete. There are many issues in the software design and human factors design of such interfaces for which we do not have answers. These answers will come as we press ahead toward our goal of building an operating prototype.
Our approach to the interface software is based on the technology of language translation, as other efforts were also [8,17,18]. Language translation provides a structure for representing dialogue, but it has not been helpful in characterizing the initial processing of input or the presentation of graphics. Other paradigms for representing dialogues are objects [6,9,15], event driven processes [7,22], constraint systems [1,5,16], and programming language extensions [4,11,19].
Gestural interfaces have appeared now and again in particular domains, such as text editors [14], graphical editors [3], musical score editors [2], and as interfaces to text based applications [25]. Our work differs from these primarily in the breadth of our attack on the problem. Other systems do not try to combine handwriting and gesture, and this is crucial if one is to avoid the necessity of the enduser switching continually back and forth between pen and keyboard. We are fortunate to be able to take advantage of several years of work at our laboratory on handwriting recognition, and feature based recognition of symbols.
One can argue forcefully for the benefits of gestural interfaces, but such arguments depend on being able to implement such interfaces at reasonable cost and in reasonably useful technologies. It appears that the technologies are now or will be soon at hand. There are many problems yet to be solved in the implementation of software. We hope that others will be encouraged to address these problems in order to realize the tremendous potential of gestural interfaces.
This work could not have been realized without the contributions of our colleagues, particularly Chuck Tappert, Joonki Kim, Shawhan Fox, and Steve Levy, who contributed recognition algorithms and software, Cathy Wolf, whose experimental subjects gave crucial insights on the nature of gestural dialogues, John Gould, who explored the human factors of handwriting and studied users of spreadsheets, and Robin Davies, who guided the entire project. Inspirational credits are due to Alan Kay and Adele Goldberg. Bill Buxton has given many useful comments and criticisms on this endeavour, along with his strong support for its purpose.
[1] Borning, A. The Programming Language Aspects of ThingLab, A Constraint-Oriented Simulation Laboratory, ACM Transactions on Programming Languages and Systems 3, 4 (1981), 353387.
[2] Buxton, W., Sniderman, R., Reeves, W., Patel, S. and Baecker, R. The evolution of the SSSP score editing tools, Computer Music Journal 3, 4 (1979), 1425.
[3] Buxton, W., Fiume, E., Hill, R., Lee, A. and Woo, C. Continuous hand-gesture driven input. Proceedings of Graphics Interface'83 (1983), 191195.
[4] Cardelli, L. and Pike, R. Squeak: a language for communicating with mice, Proceedings of SIGGRAPH'85 (San Francisco, Calif., July 2226, 1985). In Computer Graphics 19, 3 (July 1985), 199204.
[5] Duisberg, R. Animated Graphical Interfaces using Temporal Constraints. Proceedings CHI'86 Conference on Human Factors in Computing Systems (Boston, April 1317, 1986), ACM, New York, 131136.
[6] Goldberg, A. and Robson, D. Smalltalk-80: The Language and Its Implementation, Addison Wesley, 1983.
[7] Green, M. The University of Alberta user interface management system, Proceedings of SIGGRAPH'85 (San Francisco, Calif., July 2226, 1985). In Computer Graphics 19, 3 (July 1985), 205213.
[8] Hanau, P. and Lenorovitz, D. Prototyping and Simulation Tools for User/Computer Dialogue Design. Proceedings of SIGGRAPH'80 (Seattle, Wash., July 1418, 1980). In Computer Graphics 14, 3 (July 1980), 271278.
[9] Henderson, D.A. Jr. The Trillium User Interface Design Environment. In Proceedings CHI'86 Human Factors in Computing Systems (Boston, April 1317, 1986), ACM, New York, 221227.
[10] Hutchins, E.L., Hollan, J.D. and Norman, D.A. Direct manipulation interfaces. In User centered system design, Norman, D.A. and Draper, S.W. (eds.), Lawrence Erlbaum Associates, Hillsdale, NJ, 1986, 87124.
[11] Kasik, D.J. A user interface management system, Proceedings of SIGGRAPH'82 (Boston, Mass., July 2630, 1982). In Computer Graphics 16, 3 (July 1982), 99106.
[12] Kay, A. and Goldberg A. Personal Dynamic Media. IEEE Computer 10, 3 (1977), 3141.
[13] Kay, A. Microelectronics and the Personal Computer. Scientific American 237, 3, (Sept. 1977), 231.
[14] Konneker, L.K. A Graphical Interaction Technique Which Uses Gestures. Proceedings of the First International Conference on Office Automation, (1984), IEEE, 51-55.
[15] Lipkie, D., Evans, Newlin, J. and Weissman, R. Star Graphics: An Object Oriented Implementation. Proceedings of SIGGRAPH'82 (Boston, Mass., July 2630, 1982). In Computer Graphics 16, 3 (July 1982), 115124.
[16] Nelson, G. Juno: a Constraint-Based Graphics System. Proceedings of SIGGRAPH'85 (San Francisco, Calif., July 2226, 1985). In Computer Graphics 19, 3 (July 1985), 235243.
[17] Olsen, D.R. Jr. and Dempsey, E. SYNGRAPH: A Graphical User Interface Generator, Proceedings of SIGGRAPH'83 (Detroit, Mich., July 2529, 1983). In Computer Graphics 17, 3 (July 1983), 4350.
[18] Parnas, D. On the Use of Transition Diagrams in the Design of a User Interface
for an Interactive Computer System. Proceedings ACM 24th National Conference, (1969), 378385.
[19] Pfister, G. A High Level Language Extension for Creating and Controlling Dynamic Pictures. Computer Graphics 10, 1 (Spring 1976), 19.
[20] J. R. Rhyne and C. G. Wolf. Gestural Interfaces for Information Processing Applications. T. J. Watson Research Center, IBM Corporation, 1986, IBM Research Report RC-12179.
[21] Rhyne, J.R. Dialogue Management for Gestural Interfaces. IBM Research Report RC-12244 (An extended version of this article), 1986.
[22] Schulert, A.J., Rogers, G.T. and Hamilton, J.A. ADM—A dialog manager. In Proceedings CHI'85 Human Factors in Computing Systems (San Francisco, April 1418, 1985), ACM, New York, 177183.
[23] Shneiderman, B. Direct manipulation: a step beyond programming languages, IEEE Computer 16, 8 (1983), 5769.
[24] Tappert, C.C., Fox, A.S., Kim, J., Levy, S.E., and Zimmerman, L.L. Handwriting recognition of transparent tablet over flat display. SID Digest of Technical Papers XVII, (May 1986), 308312.
[25] Ward, J.R. and Blesser, B. Interactive Recognition of Handprinted Characters for Computer Input. IEEE Computer Graphics and Applications 5, 9, (Sept. 1985), 2437.
[26] Wolf, C.G. Can People Use Gesture Commands. SIGCHI Bulletin 18, 2 (October 1986), 7374. Also IBM Research Report RC 11867.