§3.2.1 Why are our needs different from the rest of the world?
In normal DBMS's used in business, most of the data is fixed format. This makes it easier to process and store. Documents used and stored in computers are often preformatted, indexed in standard ways. Much of the success that computers have had in business rests upon this practice (and, in fact, much of the success of standard business practices).
We believe that we need to store information in far more flexible formats than those used in standard systems. A document storage system we believe should support the following:
one document can be linked to many others;
there can be a diversity of form, context, and content;
document history and evolution is recorded;
widely varying sizes can be handled
large numbers of documents can be easily manipulated.
These needs have been noted individually by many others. Building a system that combines these features requires both research and development that we want to do. We have developed these ideas in terms of the needs below.
§3.2.2 Medium term needs
f More natural data model for documents
To store and manipulate electronic documents today, one has to deal with file systems and/or databases with very simple data models. There are restrictions on the size of fields, types of interconnections, enforcement of naming, system cost and performance, and the size of the information space.
f We need multi"media storage
Multiple types of objects must be stored. These should include plain text, structured text, scanned image, mail message, digitized voice, digitized video, bitmaps, uninterpreted byte arrays. If the system distinguishes types, it can provide special purpose compression and data search algorithms.
f Data links
We need to manipulate links between different data elements. This is often called the hypertext model. Supporting links between text, pictures, and voice makes this more like ªhypermediaº. With this data model in the underlying system, it can provide more efficient storage and retrieval, and will match users' models of stored information. This should allow support of current projects such as Notecards and Colab. It is also compatible with ideas developed for the Object Service.
f Versions and alternatives
Documents by their very nature evolve. As they evolve, new versions of them are created. In addition, alternatives to a document may be created. Version and alternative trees must be supported in the document base system.
f Naming, attributes, and indexing
We want to support multiple access styles in the document base. By allowing users to give documents names, we can support standard file server needs. By allowing attributes and keywords on documents, with appropriate indices, we can support simple rapid access by document content or history.
f Multi"user concurrent access
Documents must be sharable. Several clients must be able to access to the same document simultaneously.
f Distributed, remote servers
Documents will be stored on servers found on the internet. Servers must be able to exchange data, support different protocols for users, and share data. The server must support transactions. Locking at the granularity of at least a document is required. Very large objects require that they have page level access (or access to document fragments).
f Alerters and triggers
Alerters are a simple form of triggers. Some class of events trigger the sending of a ªmessageº to clients. Events can include insert, delete, read, lock, or update of some class of objects. Alerters provide a ªhookº for other systems (e.g., Notecards or Colab). Alerters can be used to assist cache and screen maintenance on workstations, and to trigger some service at document creation (e.g., document recognition).
f Large capacity
The requirements for digital on"line storage in this time frame is very large. We want to store very large files (e.g. images, video sequences etc). To not be limited in the next few years we should probably plan on increasing our storage available by at least a factor of10. This requires between 500 and 1000 gigabytes.
A storage hierarchy is required to get acceptable performance with reasonable cost. Not all of the documents must be ªimmediatelyº accessible. Delays of seconds are acceptable for infrequently referenced or archived documents. Archival storage is required to keep all relevant versions of objects. Optical media are an appropriate bottom end.
f Data compression
Data compression is a useful adjunct for the system. For many classes of document (e.g., scanned images), data compression radically changes the amount of storage available for a fixed cost (an order of magnitude or more).
f High performance
While we are unable to give exact number, we must have a system that allows easy experimentation with information retrieval strategies. In the same way that Dorados were higher performance personal machines than anyone else had available at the time they came on board, the document base server must initially overwhelm the needs of the problem.
f Building a Service
To support a service, the system must be robust, available, and replicable. It must provide administrative functions such as debugging, backup, historical logging, and monitoring. Security and access control must be provided. No server should be isolated. It must cooperate with other server and foreign data systems.
§3.2.3 Medium term options
f Currently available systems
Conventional database management systems such as Oracle and RTI Ingres have a number of important properties:
they are robust (they recover from CPU and media failure);
have highly available data (using data replication);
and present the client an understandable data model.
However these relational systems won't fill our document storage and retrieval needs because they are limited in:
the size of fields
a data model that doesn't easily support hypertext
problems with extensibility of types
no support for hierarchical data models
Many desirable document operations, such as finding the transitive closure of a bibliography, will inevitably take too long for users.
f External RDBMS systems being developed
We considered whether it would be useful to try to set up a joint project with other organizations outside of Xerox. Before we actually start our own project we will pursue each of these options somewhat further. The problems that we are worried about are:
Starburst is an IBM ARC project. It is very unlikely that they would agree to joint research.
Postgres (Berkeley) and Exodus (Wisconsin) both are university projects. Hence there is some hope that they might be willing to cooperate. However, both are RDBMS projects. Their data model does not match all that well with what we see as the requirements.
A few databases have ªlong fieldsº or images as primitive data types. Wang, Sybase, and Nixdorf are examples (we think). This solves the size problems but not the data model and other problems. However it is difficult to get access to sources and to keep up with their system updates.
f Filenet, Cygnet, Access ...
Optical disk servers are (or shortly will be) sold by some companies. Although the jukebox and optical disk hardware are interesting, the software layer they sell does not satisfy most of our needs.
Purchase of a full system, such as from Filenet, is not an option. The scanners, printers, and workstations may not be those that integrate with Xerox's long term product plans.
Jukeboxs and optical disk hardware are not commodity items. Other technologies may make them obsolete in a few years. Service, lifetime of product, and maintenance are questionable. We think it will probably be the right thing to get one of these as a bottom end for our project, but for the long term there is some risk in dealing with startup companies. OEM prices for a Filenet jukebox are about ¤160K while Cygnet is somewhat cheaper at ¤115K, though the features are somewhat different.
f Expand the Object Service Project
The Object Service has different goals and time scale from those presented here; the Object Service is exploring a higher level execution based model of storage, and the utility of persistent processes. It also is exploring a seamless integration of long term storage with the Smalltalk programming system. Rather than trying to combine the current needs for a document base with the long term goals of the Object Service project, it seems more appropriate in the medium term to have separate projects. The server discussed here may be able to act as a storage backend for the Object Service.
f Build a system that meets our document storage needs
This is what we recommend. The server would implement the data model developed jointly between CSL and ISL over the last few months. For large volume storage, the server would use either a jukebox optical disk unit directly connected to the server, a network server with a jukebox optical disk, or use a relational database system in conjunction with a jukebox or a networked jukebox. Hagmann and Kent have agreed to act as coordinators (leaders?) of this new project. A choice must be made whether to implement the service in their natural language (Cedar) or to explore other options. The viability of this choice is dependent on having a porting strategy of Cedar to new hardware, and the availability of new hardware. To make this system usable on our Dorados, we will need 10 mHz Ethernet boards on them. This is another source of risk as well as a cost item.
To build this service requires:
the purchase of a jukebox optical disk for archival storage (¤120K " ¤200K).
possibly the purchase of index server software, a RDBMS system, or file server software (¤0 " ¤50K).
the purchase of a computational and communications platforms. These include computers with large memories for the servers, several gigabytes of magnetic disk storage (5 gigabytes cost ¤40K), 10 mHz Ethernet communication gear installed on servers and clients and/or other communication media (total costs for this part will be determined by decisions that have not yet been taken).
building a group of 5 to 7 qualified staff to implement the service that satisfies the medium term document storage needs detailed in this document.
The server would act as a front end, provide the data model, and manage the short, medium, and long term storage. It would most likely be written in Cedar and initially run on Dorados. However, it must be portable and must move to the server architecture that will result from the ªComputational Baseº committee.
The staffing of the project is critical to any time table we might propose. However, to give an idea of phasing, some sort of milestones were requested by the OCM. With reservations, we propose the following milestones taken from ªstart of codingº:
9 months: large capacity file server speaking a ªstandardº protocol (NFS?) with backup
18 months: add hypertext data model with alerters
24 months: add versions, alternatives, compression, and other features that will become evident by that time.
This conclusion rests, in part, on the assumptions we have made. To be complete, we should understand the implications of our assumptions and validate our assumptions. Unfortunately, time and manpower have not allowed us to be as complete as we would like to be. We have relied principally on our own experiences and understanding of external research and events. It may be appropriate to have our conclusion reviewed by an outside consultant.