<<16Rhyne.tioga>>
<<Rick Beach, February 15, 1987 8:09:08 pm PST>>
Dialogue Management for Gestural Interfaces
Jim Rhyne
T. J. Watson Research Center
IBM Corporation
P. O. Box 218
Yorktown Heights NY 10598
Abstract
    Gestural interfaces are electronic analogues to pencil and paper.  Since the effectiveness of such interfaces depends heavily on 
    enduser familiarity with pencil markup of printed documents, the interface must conform to the user's behavior and not rely on 
    educating the enduser.  Spatial relationships among the gestural forms partially determine the syntactic interpretation of 
    gestures, along with information about the context in the neighborhood of the gesture. Temporal grouping of gestural forms is 
    more important than their temporal sequence.  Such characteristics suggest a form of dialogue recognition in which rules do not 
    specify temporal ordering of forms, and in which multiple parses are carried out in parallel.
Introduction
    Gestural interfaces are a member of the family of direct manipulation interfaces [10,23].  They employ a stylus on a tablet, and 
    are capable of recognizing meaningful configurations of strokes, including handwritten text, pointing, and other stroke 
    configurations which we will term gestures.
    The ideal input-output medium for gestural interfaces is a thin, flat display on which the stylus is directly used.  Ideally, the 
    display/tablet package would be the size and weight of a large book and could be carried around by the enduser as an electronic 
    clipboard.  Alan Kay proposed a similar notion about ten years ago, which was called the DYNABOOK [12,13].  Kay and Goldberg 
    also gave some consideration to the use of gesture and handwriting recognition in this project, but their interest in the 
    development of the Smalltalk programming language led them away from further work on gestural interfaces (A. Goldberg, personal 
    communication, December 1986).
    Gestural interfaces have a number of potential advantages:
    ·    A single gesture can be equivalent to many keystrokes and mouse actions;
    ·    The package can be held in one hand and operated with the other, permitting its use in shops, warehouses and manufacturing 
    facilities, or where the other hand must be free to use a telephone, calculator, etc..
    ·    The interface is silent, facilitating its use in group meetings.
    And a couple of potential disadvantages:
    ·    Handwriting is 2-5 times slower than keyboard entry for text;
    ·    Gestural interfaces will be more expensive than keyboard/mouse interfaces.
    We have no data to predict the tradeoff of advantages and disadvantages.  Our goal is to understand the human factors of gestural 
    interfaces, the methods and costs of constructing them, and the aspects of the input and display technology which limit their 
    use [20].  Several pilot human factors experiments have been done using pencil and paper and simulated interfaces, to help us 
    understand the issues.  In addition, these experiments have been a valuable source of gestural examples, which we need to 
    determine the functions our interface software must perform [26].  This report addresses the software architecture of gestural 
    interfaces as we presently understand it [21].
Dialogue Structure in Gestural Interfaces
    Dialogue structure is best described at two levels: the functional structure, and the stroke (surface) structure.  The functional 
    structure defines the functional roles played by gestures.  The stroke structure characterizes the types and grouping of 
    strokes to form a gesture.  The task of the dialog manager is to transform the stroke structure into the functional structure.
    The fundamental element of a gesture is a stroke, i.e. a timestamped series of positions beginning with contact of the stylus or 
    finger with the input surface, and ending when contact is broken.  Strokes are initially categorized by their areal extent as 
    pointing strokes, handwriting strokes, or gestural strokes.  This classification only gives clues to the stroke function. A 
    pointing gesture may really be the dotting of an i, or a decimal point in a number. The functional role of a stroke can only be 
    realised by extraction of its geometric features, examination of its potential role in the rest of the gestural dialogue, and 
    investigation of the display objects with which it may be correlated.
Functional Elements of Gesture
    The functional elements correspond to those found in command languages.  The functional elements consist of operations, scopes, 
    targets, literals, and modifiers.  The following examples will illustrate functions associated with strokes in a gestural 
    interface.
    1a.                  1b.               1c.           1d.
     
    $123.00        $456.00          $123.00     $123.00 $456.00        $123.00 $456.00
    $ 2.00  $ 75.00    $456.00     $  2.00 $ 75.00        $  2.00 $ 75.00
                        $789.00     $123.00 $456.00        $123.00 $456.00
                                    $  2.00 $ 75.00        $  2.00 $ 75.00
     
     
    2a.             2b.               2c.                     2d.
     
    manly $775.23         $ 12.50
                                 $ 99.40                   33.33
                                                           12.24
     
    3a. 3b.                   3c.      3d.
     
    12,000 1,350   now is the times      1234     123
    11,500 1,075                         5678     456
                                                    789
     
     
    Figure 1. Examples of Gesture
Scoping
    Scoping is the act of selecting a set of objects to be acted upon.  A scope is a functional element, not to be confused with a 
    surface interaction technique such as picking or menu selection.  Scopes may be specified at the surface level by describing 
    the objects to comprise the set (e.g. all objects having a given property), by using the external names of the objects, or by 
    pointing, directly with the finger or stylus, or indirectly with a mouse and cursor.  Scoping in a gestural interface consists 
    mainly of markings that enclose displayed objects, markings that traverse displayed objects, and markings that indicate the 
    spatial extent of a set of displayed objects.  Some examples of gestural scoping are shown in 1a-d above.
Target indication
    Target indication principally takes two forms: pointing with the stylus as the location of interest, and entering a gesture such as 
    a caret or arrow.  In some cases, an operation symbol itself may indicate a target position.  Examples 2a-c above show target 
    indication in conjunction with other functional elements.
Operations
    Operation specifications may be explicit gestures, or they may be implied by the group of related gestures. In 2a, the combination 
    of the arrowhead and the handwritten letter i imply the insertion of the letter in the text. Explicit operation specifications 
    are often symbols (3c-d), or gestures (3a-b).
Literals and modifiers
    A literal element is a handwritten word or character to be added to
    text or replace text. Examples of literals are seen in 2a and 2d.
    Modifiers are typically handwritten words or characters, but are interpreted as parameters for the operation, as seen in example 2c.
Issues in gestural dialogue
Syntactic compression in gesture
    A single stroke may contain many functional elements. The following example shows two gesture variants, one a single stroke and the 
    other made of three strokes.
       123                      123
       456                      456
    The capability of the user to specify a complete command with a single stroke is one of the significant efficiencies of gesture.  
    Extracting the functional elements from the stroke is a complex process.  Our present approach is to submit the gesture to a 
    preliminary extraction of geometric features (e.g. the nearly closed curve, and the remaining arc) and treat these features as 
    tokens, grouped by the fact of their being extracted from a single stroke.  The three stroke variant also produces three 
    tokens, temporally and spatially grouped.  The primary dialogue rule that looks for a scope and a target and, optionally, an 
    interconnecting arc then works in either case.
Contextual effects
    The use of context (i.e. the function and structure of displayed objects in the neighborhood of a gesture) is crucially important 
    in the interpretation of gesture. This may be seen in many of the preceding examples, especially those involving insertion, 
    where the point of insertion is not the coordinates of the caret or arc tip, but rather the logical point of between two 
    letters or two words, or in the proper row and column.
    Context may also trade off against the preciseness of scoping gestures.  Our subjects frequently cut off parts of displayed objects 
    they intended to include in a scope, and marked over parts of material they did not intend to include in a scope. Yet, it is 
    clear to the human observer what was intended by the scope.
    Context also takes the form of application dependent metarules which control the interpretation of dialogue. We asked some subjects 
    to specify the summation of two columns of numbers in a display using the scope gesture and summation symbol. Some of them used 
    the sequence: scope, sum, scope, sum. Others, however, used the sequence: scope, scope, sum, sum. Had the temporal sequence 
    been used to associate each scope with its corresponding operation, the result would have been wrong. The key to understanding 
    this dialogue lies in the fact that each summation symbol was written in the same column and below the set of numbers intended 
    to be summed.  We will be giving a lot of attention to this topic over the next phase of our research.
Closure
    Closure is an event which signals the end of a dialogue phrase.  The newline or enter key is a common explicit closure signal for 
    many keyboard based enduser interfaces.  Determining closure is difficult in examples like move/copy, where the dialogues are 
    identical except one has an additional final event.  We are using an explicit closure (the user taps the pen on a closure icon) 
    until we better understand gestural interface design.
Embedded dialogues
    It frequently happens in direct manipulation interfaces, that the objects to be scoped and/or target locations are not fully 
    visible. In this case, the enduser may use object names, or may scroll the display.  Experimental evidence suggests that 
    scrolling and other more complicated dialogues may occur within another gestural dialogue. Such dialogues are typically 
    completed before the enduser continues the original or host dialogue, hence they are called embedded dialogues.
    Embedded dialogues are difficult for the computer to handle because of the decision which must be made at each user event whether 
    this event belongs to the present dialogue or is the beginning of a new one.  One rather unsatisfactory technique is to permit 
    embedded dialogues only at specified states in the host dialogue, e.g. scrolling is only permitted at certain states in 
    spreadsheet dialogues.  Direct manipulation interfaces may use another technique, namely, confining dialogues to particular 
    regions, e.g. the application work area, the pull-down menu, or the scroll bars.  Gestural interfaces may add the use of 
    distinguishable gestures that indicate the start of an independent dialogue.
    Our present implementation uses rules that may create several simultaneous dialogue threads.  It is not easy for an interface 
    designer to specify asynchronous dialogues, because of the difficulty of testing for potential ambiguities.  We must try to 
    build tools to assist the designer in building and debugging multi-threaded dialogues.
Software architecture for gestural interfaces
    The central software component of the interface is the dialogue machine, which communicates with the other subsystems:
    ·    the input/output subsystem, which manages the input and display devices, performs inking of the stylus trace, and converts 
    input device manipulations into canonical interface events;
    ·    the recognizer subsystem, comprising a trainable, pattern matching recognizer for handwriting and a feature analysis 
    recognizer for gesture [24];
    ·    the application subsystem.
    Having two recognizers complicates the dialogue design. The preliminary classification of strokes may result in a stroke being sent 
    to the wrong recognizer, and the dialogue must be written to handle this.  Some heuristics about handwriting, i.e. the 
    left-to-right sequence, adjacency, and baselines are built into a post-classifier routine that examines strokes to see if they 
    belong to previously entered handwriting strokes.
Dialogue specification and processing
    A command is a grouping of functional elements that is meaningful and complete, typically represented by a tree or DAG.  The or DAG 
    also captures the surface ordering of the tokens of the command through the notion of a traversal order. The parsing technology 
    developed over the last 30 years is profoundly dependent on the traversal ordering.  In gestural languages, however, the 
    surface ordering and functional grouping may not coincide, and syntax directed parsing will not be ideal for gestural 
    dialogues.  This conclusion has been reached by others attempting to implement direct manipulation interfaces.
    Our requirement for order-free, asynchronous dialogues has led us to a rule based system driving a parallel parser.  The parser is 
    similar in operation to parsers designed in the 1950s and 1960s to produce all possible syntactically valid parses.  The design 
    of our parser is completely different, however, in that semantic constraints are incorporated within the rules, and the 
    parallel parse is employed to handle order freedom and multiple dialogue threads.
    Events are timestamped and linked in temporal order, but the parser sees all events categorized by event types. When a set of 
    events is present that satisfies the requirements of a dialogue rule, the temporal sequence links may be checked to see if the 
    events form a temporal group. The timestamps may also be checked by the rules where the dialogue requires certain actions to be 
    carried out within a prescribed time.
    Where parallel parses are permitted, deletion of erroneous parses must be handled. In our case, the parses proceed as a kind of 
    competition; the first parse to reach a stage of completion wins and eliminates the other parses.  The execution of a rule 
    constructs a graph linking the tokens input to the rule to the tokens built by the rule.  This graph may be used to undo the 
    rule, and alternative parses are eliminated by undoing the rules which created them.  Parse elimination can be an expensive 
    task.  Careful design of the interface should minimize the number of cases in which parse elimination has cascading, 
    exponential behavior.
    The following brief example demonstrates some of the features of the dialogue rules. The application is a spreadsheet, and the task 
    is to sum a column of numbers in a particular cell. The dialogue scenario consists of three strokes: a scope gesture, a Greek 
    Sigma symbol, and a closeure stroke in which the user points at a closure icon displayed on the screen.
         11.50                      sscmd
          3.87
          2.03
         99.44           ssop      ssrange       closestroke
     
     
                         symbol     gesture
     
     
                         hwstroke  geststroke
     
     
    1 hwstroke(x) =: symbol(y):hwreco(x);
    2 geststroke(w) =: gesture(z):gestreco(w);
    3 symbol(y) =: ssop(v)|sspick(v):sscorr(y);
    4 gesture(z) =: ssrange(u)|sspick(u):sscorr(z);
    5:ssop(v), ssrange(u): closestroke(t) =:
           sscmd(s):sscmd("\0x95 %a @SUM( %a . %a ) \n",
                sscell(v), ssulrange(u), sslrrange(u));
    6 sscmd(s):(rcsscmd(s) = 0):CLOSE;
    The left side of a rule contains conditions for the rule execution, written as predicates with free variables. The free variables 
    are used to specify coocurrence constraints and to transmit the events to the right sides of the rules.  The scope of a free 
    variable is a single rule.  Conditions are matched in temporal order except where curly braces are used to delimit order free 
    groups. The left side ends with the ``=:'' symbol.
    The right side of a rule consists of actions and their resulting events, written as <event-union>:<action>, where an <event-union> 
    is either a single event or a collection of alternative events resulting from execution of the action. Actions are presently 
    atomic procedures in the system, written in a programming language. Later, the rule language will be extended to include 
    specification of action procedures.
    Rules 1 and 2 match handwriting and gesture stokes, and invoke the respective recognizers (actions hwreco and gestreco that 
    create symbol and gesture events respectively). The event is created when the action is complete, and other rules may be 
    invoked while an action is being caried out.  If the action fails, the event is not created, and the rule is undone by the rule.
    In rules 3 and 4, the recognized symbol and gesture are sent to a function that analyzes their placement on the spreadsheet 
    display, sscorr.  This function may generate any one of several events, e.g. ssop, sspick, or ssrange depending on the kind 
    of symbol or gesture and its location on the spreadsheet display.
    In rule 5, the ssrange and ssop events may occur in any order, so long as they are adjacent to each other.  They must be followed 
    by the closure event described earlier.  The result of this rule is to create an event token, sscmd, by invoking a function 
    sscmd that actually formats a sequence of keystrokes to be sent to the spreadsheet.  This sequence of keystrokes causes the 
    cell pointer to be moved to the cell containg the sigma, and a formula to be entered into that cell of the form @SUM(ul.rl), 
    where ul and lr are cell addresses of the extent of the range denoted by the ssrange token.
    Rule 6 tests for proper completion of the command execution. The CLOSE action eliminates any competing parses.
Summary
    In this report, we have tried to describe some interesting results from a project that is not yet complete.  There are many issues 
    in the software design and human factors design of such interfaces for which we do not have answers.  These answers will come 
    as we press ahead toward our goal of building an operating prototype.
    Our approach to the interface software is based on the technology of language translation, as other efforts were also [8,17,18].  
    Language translation provides a structure for representing dialogue, but it has not been helpful in characterizing the initial 
    processing of input or the presentation of graphics. Other paradigms for representing dialogues are objects [6,9,15], event 
    driven processes [7,22], constraint systems [1,5,16], and programming language extensions [4,11,19].
    Gestural interfaces have appeared now and again in particular domains, such as text editors [14], graphical editors [3], musical 
    score editors [2], and as interfaces to text based applications [25].  Our work differs from these primarily in the breadth of 
    our attack on the problem.  Other systems do not try to combine handwriting and gesture, and this is crucial if one is to avoid 
    the necessity of the enduser switching continually back and forth between pen and keyboard.  We are fortunate to be able to 
    take advantage of several years of work at our laboratory on handwriting recognition, and feature based recognition of symbols.
    One can argue forcefully for the benefits of gestural interfaces, but such arguments depend on being able to implement such 
    interfaces at reasonable cost and in reasonably useful technologies. It appears that the technologies are now or will be soon 
    at hand. There are many problems yet to be solved in the implementation of software. We hope that others will be encouraged to 
    address these problems in order to realize the tremendous potential of gestural interfaces.
    This work could not have been realized without the contributions of our colleagues, particularly Chuck Tappert, Joonki Kim, Shawhan 
    Fox, and Steve Levy, who contributed recognition algorithms and software, Cathy Wolf, whose experimental subjects gave crucial 
    insights on the nature of gestural dialogues, John Gould, who explored the human factors of handwriting and studied users of 
    spreadsheets, and Robin Davies, who guided the entire project. Inspirational credits are due to Alan Kay and Adele Goldberg.  
    Bill Buxton has given many useful comments and criticisms on this endeavour, along with his strong support for its purpose.
References
[1]    Borning, A. The Programming Language Aspects of ThingLab, A Constraint-Oriented Simulation Laboratory, ACM Transactions on 
Programming Languages and Systems 3, 4 (1981), 353387.
[2]    Buxton, W., Sniderman, R., Reeves, W., Patel, S. and Baecker, R. The evolution of the SSSP score editing tools, Computer 
Music Journal 3, 4 (1979), 1425.
[3]    Buxton, W., Fiume, E., Hill, R., Lee, A. and Woo, C. Continuous hand-gesture driven input. Proceedings of Graphics 
Interface'83 (1983), 191195.
[4]    Cardelli, L. and Pike, R. Squeak: a language for communicating with mice, Proceedings of SIGGRAPH'85 (San Francisco, Calif., 
July 2226, 1985). In Computer Graphics 19, 3 (July 1985), 199204.
[5]    Duisberg, R. Animated Graphical Interfaces using Temporal Constraints. Proceedings CHI'86 Conference on Human Factors in 
Computing Systems (Boston, April 1317, 1986), ACM, New York, 131136.
[6]    Goldberg, A. and Robson, D. Smalltalk-80: The Language and Its Implementation, Addison Wesley, 1983.
[7]    Green, M. The University of Alberta user interface management system, Proceedings of SIGGRAPH'85 (San Francisco, Calif., 
July 2226, 1985). In Computer Graphics 19, 3 (July 1985), 205213.
[8]    Hanau, P. and Lenorovitz, D. Prototyping and Simulation Tools for User/Computer Dialogue Design. Proceedings of SIGGRAPH'80 
(Seattle, Wash., July 1418, 1980). In Computer Graphics 14, 3 (July 1980), 271278.
[9]    Henderson, D.A. Jr. The Trillium User Interface Design Environment. In Proceedings CHI'86 Human Factors in Computing Systems 
(Boston, April 1317, 1986), ACM, New York, 221227.
[10]    Hutchins, E.L., Hollan, J.D. and Norman, D.A. Direct manipulation interfaces. In User centered system design, Norman, D.A. 
and Draper, S.W. (eds.), Lawrence Erlbaum Associates, Hillsdale, NJ, 1986, 87124.
[11]    Kasik, D.J. A user interface management system, Proceedings of SIGGRAPH'82 (Boston, Mass., July 2630, 1982). In Computer 
Graphics 16, 3 (July 1982), 99106.
[12]    Kay, A. and Goldberg A. Personal Dynamic Media. IEEE Computer 10, 3 (1977), 3141.
[13]    Kay, A. Microelectronics and the Personal Computer. Scientific American 237, 3, (Sept. 1977), 231.
[14]    Konneker, L.K. A Graphical Interaction Technique Which Uses Gestures. Proceedings of the First International Conference on 
Office Automation, (1984), IEEE, 51-55.
[15]    Lipkie, D., Evans, Newlin, J. and Weissman, R. Star Graphics: An Object Oriented Implementation. Proceedings of SIGGRAPH'82 
(Boston, Mass., July 2630, 1982). In Computer Graphics 16, 3 (July 1982), 115124.
[16]    Nelson, G. Juno: a Constraint-Based Graphics System. Proceedings of SIGGRAPH'85 (San Francisco, Calif., July 2226, 1985). 
In Computer Graphics 19, 3 (July 1985), 235243.
[17]    Olsen, D.R. Jr. and Dempsey, E. SYNGRAPH: A Graphical User Interface Generator, Proceedings of SIGGRAPH'83 (Detroit, Mich., 
July 2529, 1983). In Computer Graphics 17, 3 (July 1983), 4350.
[18]    Parnas, D. On the Use of Transition Diagrams in the Design of a User Interface
for an Interactive Computer System. Proceedings ACM 24th National Conference, (1969), 378385.
[19]    Pfister, G. A High Level Language Extension for Creating and Controlling Dynamic Pictures. Computer Graphics 10, 1 (Spring 
1976), 19.
[20]    J. R. Rhyne and C. G. Wolf. Gestural Interfaces for Information Processing Applications. T. J. Watson Research Center, IBM 
Corporation, 1986, IBM Research Report RC-12179.
[21] Rhyne, J.R. Dialogue Management for Gestural Interfaces. IBM Research Report RC-12244 (An extended version of this article), 
1986.
[22]    Schulert, A.J., Rogers, G.T. and Hamilton, J.A. ADMA dialog manager. In Proceedings CHI'85 Human Factors in Computing 
Systems (San Francisco, April 1418, 1985), ACM, New York, 177183.
[23]    Shneiderman, B. Direct manipulation: a step beyond programming languages, IEEE Computer 16, 8 (1983), 5769.
[24]    Tappert, C.C., Fox, A.S., Kim, J., Levy, S.E., and Zimmerman, L.L. Handwriting recognition of transparent tablet over flat 
display. SID Digest of Technical Papers XVII, (May 1986), 308312.
[25]    Ward, J.R. and Blesser, B. Interactive Recognition of Handprinted Characters for Computer Input. IEEE Computer Graphics and 
Applications 5, 9, (Sept. 1985), 2437.
[26]    Wolf, C.G. Can People Use Gesture Commands. SIGCHI Bulletin 18, 2 (October 1986), 7374. Also IBM Research Report RC 11867.