Querying Relational Databases with Null Values Introduction Database users, designers, and researchers have noted the need for an explicit representation for missing information. The universal relation, for example, requires the use of null values to represent inapplicable information. Other applications may need markers for other types of partial information: data not yet known, data unknowable, inclusion of data not permitted. A systematic methodology is needed for manipulating each type of null value. Unfortunately, a straightforward approach to the use of null values in queries leads to various anomalies. For example, the answer to the query RETRIEVE STUDENT WHERE GPA < 3.0 OR GPA >= 3.0 may not be all of the STUDENT relation. Similarly, R [AB] JJ R [BC] may not give us all of R, in a case where intuition would suggest otherwise. Updates are even more difficult than queries. In this paper we will not examine the null value manipulation problems that are unique to updates. Possible Goals for Relational Operator Definitions There are numerous conflicting properties which one might wish the relational operators (projection, selection, union, difference, cartesian product?) to satisfy. Intuitively correct answers. The ultimate test of any null value handling scheme is whether the users judge the answers to be correct. Unfortunately, several approaches are equally appealing to intuition. The choice of a method will ultimately depend on the desired semantics of the application. Not too slow. Some mechanisms guaranteed to produce correct answers would be exceptionally slow. Obey the set theoretic rules. One might wish the relational operators to obey the usual set theory axioms; for example, A IN B = A - (A - B). Usual results in the absence of null values. When the database instantiation happens to contain complete information, we might like the results of our queries to be the same as though the query language implementation contained no special provisions for null values. Truth functionality. This property of two valued logic may be hard to maintain in a system with null values. Null completion. Interesting properties appear when we let our database include, for each tuple t, all tuples such that for each attribute A of t, either t*[A] = t[A] or t*[A] = null. Types of Null Values Meaning of the null value. The most common uses of null values are to represent data that is unknown and to represent data that is inapplicable. Most work has been on schemes incorporating one or the other or both of these meanings. One author used nulls to mean that no information whatsoever was available; i.e., it was even unknown as to whether the attribute was inapplicable or not.. The results of these assumptions about the meaning of the null value will be explored in detail below. Ranges, subsets, and arbitrary predicates. When the value of an attribute is unknown for a particular entity, it may be that we do have partial information about that attribute. For example, we may know that a student is between twenty and thirty years old, or that he is either Canadian or American, without knowing which is true. We may wish to record more complicated information; for example, we may not know whether he is a PhD candidate yet, or his monthly stipend, but we may attach a predicate to the stipend field that says, if the student has been admitted to candidacy, then his stipend is $750 a month; otherwise it is $700 a month. While such assertions and restrictions greatly increase the exppressive power of the database, they also complicate the query answering process. The required reasoning processes become sufficiently complex to begin to draw on the accumulated skills of artificial intelligence researchers. Answering queries in this arena may be as difficult as solving the faamiliar puzzle about which nationality lives in which house and drinks what sort of drink and keeps what animal for a pet. In this paper we will focus on more traditional problems, at the loss of this expressive power which is our ultimate goal. Marked and unmarked. It may be the case that two tuples exist, both of which have a missing value for some attribute A; yet it may be known that the tuples must agree on A. To be able to tell whether two null values are equal, we must uniquely identify each distinct occurrence of a null value. This approach is useful in resolving join anomalies as discussed below. A Three Valued Logic Approach As the standard by which other approaches may be studied, let us examine a system fashioned after that proposed by Codd [197?]. Assume that null means unknown at present, and that nulls are unmarked. Then we may define the following system of three valued logic: AND T F null T T F null F F F F null null F null OR T F null T T T T F T F null null T null null NOT T F F T null null A simple selection formula such as (GPA > 3.0) can then be evaluated for a given tuple t by substituting in t.GPA. Non-null values will be handled in the usual way; the result of evaluating A op k, k a constant, will be null if A is null unless k is also null. More complicated selection formuli are evalueated by applying this rule to each clause separately, then combining the clauses according to the tables above. This approach is known as the null substitution proinciple, and may be stated more formally as [Codd]: An expression evaluates to null if and only if both of the following apply. a. Each occurrence of null in the expression can be replaced by a nonnull values (possibly a distincct one for every occurrence) so as to yuield the value T for the expression. b. Each occurrence of null in the expression can be replaced by a nonnull values (possibly a distincct one for every occurrence) so as to yuield the value F for the expression. The use of a different nonnull value in each occurrence makes this approach truth-functional: The truth value of a clause can be determined by separate consideration of each of its parts. Unfortunately, this means that a tautology such as (p and not p) or (p = p) will evaluate to null in the precense of null values, rather than to T; not every tautology of two valued logic is a tautology in three valued logic, although the converse is true. Our solution will lie in the use of a more sophisticated, non truth functional tuple substitution approach. A tuple belongs to the true-result of the query if substitution of the tuple into the suelection clause yields the value T; if the tuple evaluates to null, then it belongs to the maybe-result. The relational projection operator will be the usual one. Union? Joins are usually defined as a selection operation on the cartesian product of two relations. This definition gives us a true-result for relations with null values. In addition, Codd includes an "outer" join, whereby in computing the join of R and S, for each tuple in R that does not join with any tuples in S, that tuple is still inluded in the join by joining it with an all-null tuple from S. This is helpful if we are trying to think in terms of a universal relation. The join may give unexpected results. Consider a relation R: STUDENT COURSE PROFESSOR s1 null Papadimitriou If we join together the projection of R onto STUDENT and COURSE and the projection of R onto COURSE and PROFESSOR, we will get the empty relation. This happens because null is unmarked in this relation, and the query processor has no way to recognize that the two occurrences of null must be one and the same. To handle the remaining operators of the relational calculus, it suffices to define union and membership.??? Because nulls are all considered to be distinct, this formulation violates some of the usual tenets of set theory; for example, it is no longer the case that for a relation R, necessarily R = R, and necessarily R is a subset of the union of R and S. We might wish [Grant] to add a maybe-result for set inclusion, so that given STUDENTS PROFESSORS null null winslett winslett rathmann we will say that STUDENTS is definitely a subset of PROFESSORS, and PROFESSORS may be a subset of STUDENTS. Types of Null Values Meaning of the null value. The most common uses of null values are to represent data that is unknown and to represent data that is inapplicable. Most work has been on schemes incorporating one or the other or both of these meanings. One author used nulls to mean that no information whatsoever was available; i.e., it was even unknown as to whether the attribute was inapplicable or not.. The results of these assumptions about the meaning of the null value will be explored in detail below. Types of Null Values Meaning of the null value. The most common uses of null values are to represent data that is unknown and to represent data that is inapplicable. Most work has been on schemes incorporating one or the other or both of these meanings. One author used nulls to mean that no information whatsoever was available; i.e., it was even unknown as to whether the attribute was inapplicable or not.. The results of these assumptions about the meaning of the null value will be explored in detail below. Ê˜J˜.head˜Ibody˜ÅL˜ÕL˜‘˜3L˜¢LšÏiœ˜ªLšœU˜aLšœr˜ŽLš+œà˜‹LšœZ˜mLšœ©˜¸L˜—˜LšœÖ˜ïLš*œß˜‰L˜ÞLšœà˜ó—˜LšœœœÚ˜„LšœÅ˜ÅL˜¾L˜kLšœÃ˜ÃL˜9L˜L˜ÛL˜§L˜lL˜ù—˜L˜ïL˜—˜L˜ïL˜———…—$%