NYLXS - Free Software Events

Fri Jul 26 19:50:20 2024

EVENTS

FREE
SOFTWARE
INSTITUTE

POLITICS

JOBS

MEMBERS'
CORNER

MAILING
LIST

NYLXS Mailing Lists and Archives

NYLXS Members have a lot to say and share but we don't keep many secrets. Join the Hangout Mailing List and say your peice.

DATE 2014-11-01

LEARN

2024-07-26 | 2024-06-26 | 2024-05-26 | 2024-04-26 | 2024-03-26 | 2024-02-26 | 2024-01-26 | 2023-12-26 | 2023-11-26 | 2023-10-26 | 2023-09-26 | 2023-08-26 | 2023-07-26 | 2023-06-26 | 2023-05-26 | 2023-04-26 | 2023-03-26 | 2023-02-26 | 2023-01-26 | 2022-12-26 | 2022-11-26 | 2022-10-26 | 2022-09-26 | 2022-08-26 | 2022-07-26 | 2022-06-26 | 2022-05-26 | 2022-04-26 | 2022-03-26 | 2022-02-26 | 2022-01-26 | 2021-12-26 | 2021-11-26 | 2021-10-26 | 2021-09-26 | 2021-08-26 | 2021-07-26 | 2021-06-26 | 2021-05-26 | 2021-04-26 | 2021-03-26 | 2021-02-26 | 2021-01-26 | 2020-12-26 | 2020-11-26 | 2020-10-26 | 2020-09-26 | 2020-08-26 | 2020-07-26 | 2020-06-26 | 2020-05-26 | 2020-04-26 | 2020-03-26 | 2020-02-26 | 2020-01-26 | 2019-12-26 | 2019-11-26 | 2019-10-26 | 2019-09-26 | 2019-08-26 | 2019-07-26 | 2019-06-26 | 2019-05-26 | 2019-04-26 | 2019-03-26 | 2019-02-26 | 2019-01-26 | 2018-12-26 | 2018-11-26 | 2018-10-26 | 2018-09-26 | 2018-08-26 | 2018-07-26 | 2018-06-26 | 2018-05-26 | 2018-04-26 | 2018-03-26 | 2018-02-26 | 2018-01-26 | 2017-12-26 | 2017-11-26 | 2017-10-26 | 2017-09-26 | 2017-08-26 | 2017-07-26 | 2017-06-26 | 2017-05-26 | 2017-04-26 | 2017-03-26 | 2017-02-26 | 2017-01-26 | 2016-12-26 | 2016-11-26 | 2016-10-26 | 2016-09-26 | 2016-08-26 | 2016-07-26 | 2016-06-26 | 2016-05-26 | 2016-04-26 | 2016-03-26 | 2016-02-26 | 2016-01-26 | 2015-12-26 | 2015-11-26 | 2015-10-26 | 2015-09-26 | 2015-08-26 | 2015-07-26 | 2015-06-26 | 2015-05-26 | 2015-04-26 | 2015-03-26 | 2015-02-26 | 2015-01-26 | 2014-12-26 | 2014-11-26 | 2014-10-26

Key: Value:

Key: Value:

MESSAGE

DATE	2014-11-26
FROM	Ruben Safir
SUBJECT	Subject: [LIU Comp Sci] Problems with Normalization
From owner-learn-outgoing-at-mrbrklyn.com Wed Nov 26 01:51:56 2014 Return-Path: X-Original-To: archive-at-mrbrklyn.com Delivered-To: archive-at-mrbrklyn.com Received: by mrbrklyn.com (Postfix) id 14A93161143; Wed, 26 Nov 2014 01:51:56 -0500 (EST) Delivered-To: learn-outgoing-at-mrbrklyn.com Received: by mrbrklyn.com (Postfix, from userid 28) id EF489161157; Wed, 26 Nov 2014 01:51:55 -0500 (EST) Delivered-To: learn-at-nylxs.com Received: from mailbackend.panix.com (mailbackend.panix.com [166.84.1.89]) by mrbrklyn.com (Postfix) with ESMTP id 11125161143 for ; Wed, 26 Nov 2014 01:51:54 -0500 (EST) Received: from [10.0.0.42] (unknown [96.57.23.82]) by mailbackend.panix.com (Postfix) with ESMTP id 1246513AC1; Wed, 26 Nov 2014 01:51:52 -0500 (EST) Message-ID: <547578D7.1020604-at-panix.com> Date: Wed, 26 Nov 2014 01:53:11 -0500 From: Ruben Safir User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.0 MIME-Version: 1.0 To: ping-Tsai Chung , Samir Iabbassen , learn-at-nylxs.com, ping-Tsai Chung Subject: [LIU Comp Sci] Problems with Normalization Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-learn-at-mrbrklyn.com Precedence: bulk Reply-To: learn-at-mrbrklyn.com http://www.bkent.net/Doc/manyform.htm NOTE that is was an iEEE publication.... William Kent, "The Many Forms of a Single Fact", Proc. IEEE COMPCON, Feb. 27-Mar. 3, 1989, San Francisco. Also HPL-SAL-88-8, Hewlett-Packard Laboratories, Oct. 21, 1988. [13 pp] The Many Forms of a Single Fact ------------------------------------------------------------------------ William Kent 1988 Or, have you heard the one about the traveling salesman's data? > 1 INTRODUCTION . . . 2 ------------------------------------------------------------------------ > 2 (SOME OF) THE DESIGN OPTIONS . . . 3 >> 2.1 When each salesman has one territory... . . . 3 >> 2.2 When each territory has one salesman... . . . 7 >> 2.3 When it's many-to-many... . . . 7 >> 2.4 When it's many-to-few... . . . 8 >> 2.5 When it's `usually' many-to-one... . . . 9 >> 2.6 When some people don't have territories... . . . 11 >> 2.7 When there are several connections... . . . 12 >> 2.8 Other cases... . . . 13 > 3 WHY IT MATTERS . . . 13 >> 3.1 Data Interchange . . . 13 >> 3.2 Data Documentation (Dictionaries) . . . 14 >> 3.3 Data Design Competence . . . 14 >> 3.4 Design Tools and Methodologies: Input . . . 14 >> 3.5 Design Tools and Methodologies: Output . . . 14 >> 3.6 Data Sharing By New Applications . . . 15 >> 3.7 Redundancy Management . . . 15 >> 3.8 Data Integration . . . 15 >> 3.9 MIS Support . . . 15 >> 3.10 High-Level Queries, Distributed Data . . . 15 >> 3.11 Entity-Relationship Theory . . . 16 >> 3.12 Conceptual Schema, Three-Schema Architectures . . . 16 > 4 CONCLUSIONS . . . 17 > 5 REFERENCES . . . 17 1 INTRODUCTION We hear much nowadays about the problems of sharing and interchanging data. Diverse representations of data items need to be reconciled, such as differing data types, field lengths, units of measure, precision, etc. Different interpretations of the meanings of certain terms, measurements, and quantities need to be reconciled. Transformations are needed between various structural organizations, such as relational, hierarchical, and network models, or various file formats, or the various implementations of lists, vectors, arrays, matrices, or different object models, and so on. There is yet another dimension of variability, an unrecognized condition that may in fact account for many problems in data sharing and interchange. A simple fact, a simple connection between two things, can be implemented in a remarkable variety of field configurations. This can occur within the confines of any one structural form, and has little to do with conversions between data representations. There is an underlying fallacy, namely the assumption that a simple binary fact (relationship or attribute) always maps simply into a pair of fields. While that is the foundation of current data design methodologies, there exist a troublesome number of exceptions in practice. Thus what we have to say also bears on the current technology of data analysis and design. We will also have something to say about the implications for data dictionary, entity-relationship modelling, and a few other points. But first, let's trot out a catalog of examples, to get a concrete feeling for the nature and extent of the situation. To show how interchangeable the options are, we will take just one fact and map it into all the configurations. We will show an incredible variety of implementations for recording the fact that certain salesmen serve certain territories, which might conceptually be represented thus: ------------ serves ------------- \| SALESMAN \|--------->\| TERRITORY \| ------------ ------------- The design options aren't all equally sensible for all facts. We may have stretched a few of them for recording salesmen's territories, but they are all plausible. Most of the options really are applicable to most facts, even to numeric data (e.g., salaries), and even to other "attributes" (e.g., birthdates). You are challenged to take any fact you choose and see how many of these formats you can apply to it. And it might be instructive to look at any database to see how many of these techniques you can find there. We limit ourselves to the design of records, as though they were in flat files. With structured data models (e.g., hierarchies, networks), there are even more design options. In a network or hierarchical database, the connection between salesmen and territories might still be expressed in one record using any of these techniques, or it might take the form of owner/member records or parent/child records in the data structure. Some examples might be redundant, and I've probably missed a few cases. That's not the point. Whether there are two dozen or three dozen alternatives, the problem's the same. Furthermore, some of the examples implement variations in the semantics. We'll get back to that point afterward. ------------------------------------------------------------------------ Many of these options can be combined, so that there are actually many more possible designs than we show here. 2 (SOME OF) THE DESIGN OPTIONS 2.1 When each salesman has one territory... [Option 1] A territory field in the salesman record. This is probably the common and expected design for this fact: --------------------------------------------- \| SLSMN-NUM \| name \| dept \| TERRITORY \| ... \| =============-------------------------------- The double line under SLSMN-NUM will be our convention for key fields. [Option 2] Unnormalized. The fact could occur in a record where neither salesman nor territory is the key, e.g., in a record about orders: ------------------------------------------------------------------- \| ordernum \| item \| quantity \| price \| SALESMAN \| TERRITORY \| ... \| ============------------------------------------------------------- [Option 3] Multi-field representation. Sometimes a single entity, in this case a territory, is represented by multiple fields. Now a simple binary fact, relating a salesman and a territory, is actually occupying four fields: --------------------------------------------------------------- \| SLSMN-NUM \| name \| dept \| COUNTRY \| REGION \| DISTRICT \| ... \| =============-------------------------------------------------- ............................... territory Note that "TERRITORY" no longer appears as a field name. That's significant if we are depending on some sort of dictionary to locate information about salesmen's territories. This situation will arise in many other options as well. [Option 4] Embedded data. If a salesman's identifier included a code for his territory (let's say salesmen never change territories), then there might not have to be a territory field: --------------------------------- \| SLSMN-NUM \| name \| dept \| ... \| =============-------------------- ..... : territory code [Option 5] Shared field. Sometimes several facts are encoded into one field, e.g., a salesman's "assignment" might consist of an embedded territory code and an embedded product code: ---------------------------------------------- \| SLSMN-NUM \| name \| dept \| ASSIGNMENT \| ... \| =============--------------------------------- ... ... : : : product code territory code [Option 6] Omitted field (data is in record type name). Salesmen's records could be partitioned into record types by territory. The territory identifier is part of the record type name: --------------------------------- WESTERN-SALESMEN: \| SLSMN-NUM \| name \| dept \| ... \| =============-------------------- --------------------------------- EASTERN-SALESMEN: \| SLSMN-NUM \| name \| dept \| ... \| =============-------------------- That option is meaningful to the extent that record names are documented, e.g., in a dictionary. [Option 7] Omitted field (data is implicit). In records maintained at territory headquarters, or at a small company with only one territory (or no territory concept) - territory is a constant. It is the same for all salesmen in the file, hence the territory is not identified anywhere in the data: --------------------------------- \| SLSMN-NUM \| name \| dept \| ... \| =============-------------------- When data is being maintained at the headquarters offices of each of the territories, a salesman's territory is implied by the geographic location of the record! This is not so far-fetched. Employee records at a given company are rarely explicit about who the employer is; bank records rarely record the identity of the bank at which each account is held, and so on. [Option 8] Omitted field (data is derivable from other data). Suppose salesmen in the same department always serve the same territory. Then the territory would be identified in the department record, not in the salesman record. The salesman's territory is derivable, based on his department: --------------------------------- \| SLSMN-NUM \| name \| dept \| ... \| =============-------------------- --------------------------------------- \| dept \| name \| mgr \| TERRITORY \| ... \| ========------------------------------- [Option 9] Default flag. Suppose that the files at territory headquarters cover all salesmen assigned to that territory, and also some salesmen outside the territory. The difference is indicated by a simple binary flag in an "LCL" field, indicating whether the salesman is "local" or not: --------------------------------------- \| SLSMN-NUM \| name \| dept \| LCL \| ... \| =============-------------------------- This kind of record provides partial information. We know the territory of all the local salesmen, but not the others. [Option 10] Multiplexed field. A given field might represent the territory of a salesman - in some record occurrences. In other occurrences, the field might serve a different purpose. Suppose we extend the previous example so that for local salesmen we keep the manager's name, but for other salesmen we keep their territories instead - in the same field: -------------------------------------------------- \| SLSMN-NUM \| name \| dept \| LCL \| MGR/TERR \| ... \| =============------------------------------------- MGR/TERR has a conditional meaning: if LCL is true, then the field contains the manager, otherwise it contains the territory. [Option 11] Parametric form. This is essentially an extension of the multiplexed form. The meaning of the field is again variable, but now another field is provided to explain its meaning: --------------------------------- \| SLSMN-NUM \| parameter \| value \| \|===========+-----------+-------\| \| xx \| TERRITORY \| Texas \| \| xx \| manager \| yy \| \| xx \| product \| toys \| --------------------------------- Note that "TERRITORY" now appears in the data, not as a field name. Will it be in the dictionary? [Option 12] Self-describing form. This in turn is a variation of the parametric form. In the parametric form, it was the meaning of the field which varied from record to record, and hence had to be indicated in another field. Sometimes other aspects of a field value can vary from record to record, and hence have to be described in another field. This can include such things as units of measure, scale factors, or field lengths for variable length fields, such as the length of the TERRITORY field: --------------------------------------------------------- \| SLSMN-NUM \| name \| dept \| TERR-LNTH \| TERRITORY \| ... \| =============-------------------------------------------- [Option 13] Redundant records. Sometimes several of these forms will be maintained, each optimized for a different purpose. Exercise: in how many ways can such options be combined? For example, a territory represented by multiple "fields" could also be embedded in the salesman number: ------------------------------------------------ \| SLSMN-NUM \| name \| dept \| ASSIGNMENT \| ... \| ===============--------------------------------- ... ... ... : : : : : district code : region code country code How many other combinations make sense? Keep this question in mind as we continue to develop more options. 2.2 When each territory has one salesman... [Option 14] There could be a territory record with a salesman field: ------------------------------------------------------ \| TERRITORY \| hq-loc \| size \| SLSMN-NUM \| name \| ... \| =============----------------------------------------- [Option 15] If it's a one-to-one relationship, then we can either put a territory field in the salesman record or a salesman field in the territory record - or both. [Option 16] If the relationship is "in perpetuity", the territory might be implicitly identified by its salesman. Thus we might speak of Smith's territory and Jones's territory. We simply would not have any distinct field for recording a salesman's territory. It's not uncommon to identify things by related attributes or entities. A common relational example about elections has just two fields, the year of the election and the winning candidate. There is in fact no distinct field in which the election itself occurs. (We are even doing it here in this paper, when we identify a sales territory by the name of the state it is in.) 2.3 When it's many-to-many... [Option 17] Intersection records. The most common format for many-to-many relationships: ------------------------------- \| TERRITORY \| SLSMN-NUM \| ... \| =========================------ The additional fields in this record might contain "attributes" of the relationship between the salesman and the territory, e.g., a quota. [Option 18] Repeating fields. If the data structure supports repeating fields, we can have territory fields in the salesman record: --------------------------------------------------------- \| SLSMN-NUM \| name \| dept \| TERRITORY \| ... \| TERRITORY \| =============-------------------------------------------- [Option 19] ...or salesman fields in the territory record: ---------------------------------------------------- \| TERRITORY \| hq-loc \| SLSMN-NUM \| ... \| SLSMN-NUM \| =============--------------------------------------- [Option 20] ...or both (redundantly). Of course, all of these options for many-to-many relationships could also be used for one-to-many relationships, but the singleness on the "one" side will not be enforced. 2.4 When it's many-to-few... These are actually many-to-many relationships. But if the number of things each entity can be related to is small or limited, then some special options are available. Suppose that each salesman can only serve a few territories. [Option 21] Fixed repeating fields. If the limit on the number of territories per salesman is, for example, three, then we can define three territory fields in the salesman record: ------------------------------------------------------ \| SLSMN-NUM \| name \| dept \| TERR-1 \| TERR-2 \| TERR-3 \| =============----------------------------------------- Here we have a fixed number of fields, in contrast to the earlier example of repeating fields. There, the number of fields in a record varied, depending on how many territories a salesman had. Here, each record has the same number of territory fields, with some possibly blank. In this design, a many-to-many relationship has been clustered with other data without violating any normal form constraint. [Option 22] If there are only a few territories altogether, then we can actually define a separate field in the salesman record for each territory, with a flag to indicate whether the salesman is assigned to that territory: ------------------------------------------------------------ \| SLSMN-NUM \| name \| dept \| NEW YORK \| CALIF \| TEXAS \| ... \| \|===========+------+------+----------+-------+-------+-----\| \| xx \| \| \| no \| yes \| yes \| \| ------------------------------------------------------------ Note that in this case the field names are actually data values. The dictionary entries for this record will reflect these "data values", but not mention "territories". [Option 23] All of a salesman's territories could be crammed into one field. The field might contain a bit-vector, with a fixed position for each territory: ------------------------------------------- \| SLSMN-NUM \| name \| dept \| TERRITORIES \| \|===========+------+------+---------------\| \| xx \| \| \| 0010001000000 \| ------------------------------------------- [Option 24] ...or the field might contain a string of unique codes for the salesman's territories, i.e., a keyword rather than positional notation: ------------------------------------------- \| SLSMN-NUM \| name \| dept \| TERRITORIES \| \|===========+------+------+---------------\| \| xx \| \| \| C,T \| ------------------------------------------- [Option 25] If a few of the territories have some special significance, we might use a combination of options. A separate field can be defined for each special territory, containing a binary flag, plus another field with a flag to indicate whether the salesman has any other territories: ------------------------------------------------------------ \| SLSMN-NUM \| name \| dept \| NEW YORK \| CALIF \| OTHER \| ... \| \|===========+------+------+----------+-------+-------+-----\| \| xx \| \| \| no \| yes \| yes \| \| ------------------------------------------------------------ [Option 26] ...and there may or may not be an intersection record to detail the "others": ------------------------- OTHER-TERRITORIES: \| SLSMN-NUM \| TERRITORY \| \|===========+-----------\| \| xx \| Texas \| ------------------------- Note that this record doesn't cover all territories or salesmen. It only has entries for salesmen assigned to other than the special territories. 2.5 When it's `usually' many-to-one... This is another variant of many-to-many relationships. It is the case where most salesmen only have one territory. The general approach here is to design the basic record as though there was a many-to-one relationship between salesmen and territories, and provide another record for the exceptions. There are several ways to identify the exceptions: [Option 27] Allow a special value in the TERRITORY field of the salesman record. One consequence is that sometimes the value in this field will not be the name of a territory: --------------------------------------------- \| SLSMN-NUM \| name \| dept \| TERRITORY \| ... \| \|===========+------+------+-----------+-----\| \| xx \| \| \| New York \| \| \| yy \| \| \| Calif \| \| \| zz \| \| \| multi \| \| --------------------------------------------- ------------------------- MULTPLE-TERRITORIES: \| SLSMN-NUM \| TERRITORY \| \|===========+-----------\| \| zz \| Calif \| \| zz \| Texas \| ------------------------- [Option 28] To avoid such "pollution" of the TERRITORY field, we could provide a separate field to flag the salesmen having multiple territories: ----------------------------------------------------- \| SLSMN-NUM \| name \| dept \| MULTI \| TERRITORY \| ... \| \|===========+------+------+-------+-----------+-----\| \| xx \| \| \| \| New York \| \| \| yy \| \| \| \| Calif \| \| \| zz \| \| \| x \| \| \| ----------------------------------------------------- ------------------------- MULTPLE-TERRITORIES: \| SLSMN-NUM \| TERRITORY \| \|===========+-----------\| \| zz \| Calif \| \| zz \| Texas \| ------------------------- [Option 29] Another alternative is to divide salesmen into two kinds, those with single territories and those with many: --------------------------------------------- SINGLE-TERR SALESMEN: \| SLSMN-NUM \| name \| dept \| TERRITORY \| ... \| =============-------------------------------- --------------------------------- MULTI-TERR SALESMEN: \| SLSMN-NUM \| name \| dept \| ... \| =============-------------------- ------------------------- MULTIPLE-TERRITORIES: \| SLSMN-NUM \| TERRITORY \| ========================= 2.6 When some people don't have territories... In effect, this amounts to various ways of dealing with subtypes. For this example we will assume that salesmen have territories and other employees don't. [Option 30] Null values. The first general approach is to integrate salesmen with other employees, and use any of the previously mentioned options for all employee records. Territory information is simply left blank, or otherwise omitted, for non-salesmen. [Option 31] Separate records for special information. Common information about salesmen is kept in employee records, with separate records provided for the information that is unique to salesmen: ------------------------------- EMPLOYEES: \| EMP-NUM \| name \| dept \| ... \| ===========-------------------- ----------------------- SALESMEN: \| EMP-NUM \| TERRITORY \| ===========------------ [Option 32] Separate records for each entity subtype: -------------------------------------------- DRIVERS: \| EMP-NUM \| name \| dept \| license no \| ... \| ===========--------------------------------- ------------------------------------------- SALESMEN: \| EMP-NUM \| name \| dept \| TERRITORY \| ... \| ===========-------------------------------- [Option 33] Redundant combinations: ------------------------------- EMPLOYEES: \| EMP-NUM \| name \| dept \| ... \| ===========-------------------- ------------------------------------------- SALESMEN: \| EMP-NUM \| name \| dept \| TERRITORY \| ... \| ===========-------------------------------- 2.7 When there are several connections... Suppose that a salesman might have a home territory different from the territory he serves. We have some variants of earlier options: [Option 34] Parameterized connections: -------------------------------------- \| SLSMN-NUM \| connection \| TERRITORY \| \|===========+------------+-----------\| \| xx \| serves \| New York \| \| yy \| home \| Calif \| -------------------------------------- [Option 35] Relationship fields: ------------------------------------------------- \| SLSMN-NUM \| name \| dept \| HOME \| SERVES \| ... \| =============------------------------------------ Note that TERRITORY does not occur in a field name. The earlier simple cases might also have been formatted as ------------------------------------------ \| SLSMN-NUM \| name \| dept \| SERVES \| ... \| ------------------------------------------ which leaves us wondering as to whether we might be talking about serving territories, customers, products, meals, tables, or ...? [Option 36] Default values. Strange things happen when two values are usually the same, i.e., a salesman normally serves his home territory. Instead of having the same value repeated so often in both fields, there may be a default convention under which the SERVES field is used only for exceptions. Thus, the territory a salesman serves is normally the one in the HOME field; the SERVES field is only meaningful if it is not blank: ----------------------------------------------------- \| SLSMN-NUM \| name \| dept \| HOME \| SERVES \| ... \| \|===========+------+------+----------+--------+-----\| \| xx \| \| \| New York \| \| \| \| yy \| \| \| Calif \| \| \| \| zz \| \| \| Calif \| Texas \| \| ----------------------------------------------------- 2.8 Other cases... [Option 37] There must be more. 3 WHY IT MATTERS Our central observation isn't particularly novel or insightful; surely such design options shouldn't come as a surprise to experienced data processing professionals. But the sheer number and variety of design options available for each and every simple fact might be a surprise. Another thing not fully appreciated is the extent to which the various options are used. We haven't perversely invented a bizarre collection of rare and exotic practices. Most practical database designs employ a sizable variety of these options. We don't have any supporting statistics, and can only leave it to the reader to conduct his own survey. Our observation has wide ranging consequences in many areas of database theory and practice... 3.1 Data Interchange As mentioned at the outset, this is a dimension of the data interchange problem which doesn't seem to be getting enough attention, and yet may be the cause of many difficulties. Showing examples that implement variations in semantics isn't cheating; it reflects a real problem in data interchange. We might be trying to interchange data between sites (applications) that do have such variations in policy. Some may have one territory per salesman, others might have many. Some might have "invariant" assignments of salesmen to territories, allowing them to embed the territory identifier in the salesman identifier, while others do not. We still need to exchange data among such sites. 3.2 Data Documentation (Dictionaries) The solutions to most of the problems discussed here depend on some form of data description, e.g., in data dictionaries. Current approaches to data documentation focus on data elements. The underlying assumption is that the appropriate way to document the meanings of data is to describe the data elements. The facts contained in a database don't always correspond to data elements in a simple or direct way. They sometimes correspond to a bundle of data elements, or to a fragment of one. Sometimes a data element has multiple or conditional meanings, and sometimes a given fact might be in any of several data elements. Sometimes the information isn't in data elements at all, but in the field name, or the record name, or otherwise implicit in the environment. And sometimes the information is scattered over several record types. Data elements certainly need to be described, but all of this suggests that a new approach to data documentation may be needed, one which doesn't simply tie one description to one data element. We might also mention that some of the relevant information involving record names and field names exists in the system catalogs of relational databases, and is subject to all the difficulties mentioned herein. 3.3 Data Design Competence If you are a data analyst or designer, would you have considered all those possibilities? Perhaps one measure of a data designer's competence is the number of design options he can enumerate. Another measure might be his understanding of the relative merits of each, and the criteria for choosing among them. (That could be the subject of a very useful follow-on to this paper.) Perhaps such a catalog of design options should be part of a designer's training. At the very least, such a catalog might serve as a helpful reference document during the design process, to insure that all the appropriate alternatives had been considered. 3.4 Design Tools and Methodologies: Input Most data design tools and methodologies are driven by data elements as their input, and they only deal with the problem of how to aggregate those elements. They assume that data elements are merely collected, as part of the specification of the user's requirements. Data elements can be chosen in a variety of ways for a given application. By requiring users to specify the data elements as part of the input, current tools and methodologies have shifted much of the design burden back on the user. It would be much more useful if design tools and methodologies would help the user choose an appropriate set of data elements for his application [K4]. 3.5 Design Tools and Methodologies: Output Current tools and methodologies tend to produce a single design as output, largely driven by the particular form in which the user has expressed his input. It would be more useful if such tools and methodologies could consider all applicable designs. An ideal tool would select the best of them, but it would also be useful to present the user with a number of alternatives and perhaps help him to choose among them [K3]. Users who can't or won't make such choices can simply choose the first design offered. Of course, one thing we have demonstrated is that the number of possible designs is astronomical, since all these options are available for each and every fact in the database. Clearly, practicing designers must intuitively use some heuristic techniques to immediately trim these options down to a few interesting alternatives. It would be useful to identify and formalize such heuristics, and incorporate them into design methodologies and tools. Expert system techniques would be most appropriate. 3.6 Data Sharing By New Applications The usual scenario for data sharing by a proposed application is to identify the required data elements and then look in the dictionary to see if such elements already exist. This is a common manifestation of the underlying assumption that there's only one way to express a fact in data elements. Imagine that data about salesmen's territories existed in all of the forms illustrated above, but you didn't know about it. If you were planning an application that needed to do something with salesmen's territories, would you have found all the existing data that you might share? Would your dictionary help you with that? 3.7 Redundancy Management If two or more of those forms had gotten implemented at your installation, how would the redundancy get detected? Does your dictionary help, or any other tool? 3.8 Data Integration If you had to integrate databases like these, would you be prepared for such a range of problems? 3.9 MIS Support Suppose management wanted to find data about salesmen and their territories. How would you locate the data in any or all of these forms? 3.10 High-Level Queries, Distributed Data The nominal goal of various high-level database interfaces is to permit users to interact with data without having to know details of its implementation. Such interfaces are appropriate for user-friendly query processors, where the user should be shielded from the details of implementation, and also for distributed data, where the details of implementation may vary from site to site for the same information. Such a goal is far beyond the capabilities of any current or currently proposed systems, as our examples illustrate. A truly data-independent query interface should be able to take a given query about salesmen's territories and extract a reply from /any/ of the data formats illustrated above. The problem for distributed data is much the same. Imagine that data about salesmen were distributed over a number of sites, each using a different one of the design options illustrated above. The challenge here is to provide an interface which can uniformly process data about salesmen in any and all of those files, masking all site-specific differences. This is a well-known problem when the different files use different types of data structures (e.g., relational, hierarchical, network). What we have illustrated here is that having the same type of data structure, and even the same database product, does not eliminate the problem. Severe problems of heterogeneity can exist between databases containing the same information in the same type of data structure under the same database processor. 3.11 Entity-Relationship Theory There are many variants of entity-relationship models [Ch, K1], but their general intent is to model the information contained in data, in a data-independent manner. They model the "facts", as we have used the term, such as the fact that a salesman serves a territory. The first figure in this paper is a form of entity-relationship diagram. Too many methodologies based on E-R theory presume a simple correspondence between the E-R model and the data design. Typically, they assume that each entity type corresponds to a record type, and each attribute corresponds to a field. If such an assumption were valid, there would only be one design for our salesman fact. We have shown otherwise. There are many possible mappings from entities and relationships to data designs. In our examples, we have seen that territories may or may not have records of their own. There may or may not be a distinct record type for salesmen, and there might even be several. Data about an individual salesman might be scattered over several records. It could be claimed that each data design corresponds to a different E-R model, e.g., to different choices of possible entity types. But this is the argument that puts the design burden back on the user, since he then has to choose the E-R model that generates the record design he wants. Furthermore, that argument inhibits effective documentation of the meanings of data. A very useful way to document the fact that different databases contain the same information is to describe them in terms of the same entities and relationships, regardless of differences in the implementation formats. We can also question the need for an "attribute" concept in E-R theory, in the sense that some facts are relationships among entities while others are attributes of entities. The principal significance of this distinction seems to be that it leads to different data designs. However, a given fact can always be viewed in either way. It may not always be obvious whether, or why, a salesman's territory should be perceived as a relationship or as an attribute. The choice among data designs needs to be based primarily on data processing concerns, rather than on artificial differences in "semantic views of reality". The full range of design options may be relevant regardless of which way the user describes his facts, and hence there doesn't seem to be an intrinsic need for an attribute concept [K1, K4]. 3.12 Conceptual Schema, Three-Schema Architectures The primary purpose of a conceptual schema in a three-schema database architecture [A] is to provide a data-independent view of the enterprise whose data is being maintained in a database. Entity-relationship models are appropriate for use in conceptual schemas. By its central position in the three-schema architecture, the conceptual schema is also expected to serve as a central data model. The stored data configuration, and the external views of applications, map to the conceptual schema. For this reason, the conceptual schema often appears as a "logical data model", expressed in a form that is really data elements and records. What we have shown here is that such a logical data model is not the same thing as a data-independent model of the enterprise, and the two functions cannot be served by one and the same schema. A given enterprise description can correspond to many logical data models, and it should be possible to "tune" the latter without adjusting the former. Even when the logical data model is in the form of fully normalized relations, we do not have a unique representation. Many of our distinct design options satisfied normal form constraints. Thus we have argued [K2] that the functions of the conceptual schema ought to be split into two schemas, yielding a four-schema database architecture. 4 CONCLUSIONS We have profusely illustrated the thesis that a given fact can be mapped to data elements in many different ways, and we have outlined its consequences in many aspects of database theory and practice. In particular, we have demonstrated the need for additional dictionary/directory facilities and design tools for the proper administration of data. We can't legislate away this diversity. Different sites or applications may really have such variations in semantic constraints, as well as different resource constraints (performance requirements) that justify various implementations. And in any case, a lot of it just simply exists already and can't be ignored. What can be more readily legislated or negotiated is a common shared semantic model of the information, together with a variety of mapping specifications to link the semantic model with the data formats. Even that is hard when the semantics vary from site to site. But given a stable semantic description, we are still faced with a substantial challenge to devise, and automate, a rich enough set of mappings from semantic constructs to field formats. And let me note that I didn't invent the complexity. Don't kill the messenger. What shall we do about the problem? We can be ostriches, ignoring its existence, and continue to be baffled by the failures of various tools and interchange conventions. We can try to devise total solutions, if we had the time and patience. Or we can try to identify common and tractable special cases for simple treatment - which is what we tend to do anyhow, except we ought to stop thinking we've solved the whole problem. 5 REFERENCES [A] /The ANSI/X3/SPARC DBMS Framework/, Report of the Study Group on Data Base Management Systems, (D. Tsichritzis and A. Klug, editors), AFIPS Press, 1977. [Ch] P.P. Chen, /Entity-Relationship Approach to Information Modelling and Analysis,/ North Holland, 1981. [K1] W. Kent, /Data and Reality/, North Holland, 1978. [excerpts: html] [K2] W. Kent, "Splitting the Conceptual Schema", Proc. Sixth Intl. Conf. on Very Large Data Bases, Oct. 1-3, 1980, Montreal, Canada. [K3] W. Kent, "Choices in Practical Data Design", Proc. Eighth Intl. Conf. on Very Large Data Bases, Sept. 8-10, 1982, Mexico City, Mexico. [K4] W. Kent, "Fact-Based Data Analysis and Design", , North Holland, 1983 (Davis, Jajodia, Ng, Yeh, eds.). /Entity-Relationship Approach to Software Engineering/[html]