Mon Mar 8 19:30:05 2021
EVENTS
 FREE
SOFTWARE
INSTITUTE

POLITICS
JOBS
MEMBERS'
CORNER

MAILING
LIST

NYLXS Mailing Lists and Archives
NYLXS Members have a lot to say and share but we don't keep many secrets. Join the Hangout Mailing List and say your peice.

DATE 2014-11-01

LEARN

2021-03-08 | 2021-02-08 | 2021-01-08 | 2020-12-08 | 2020-11-08 | 2020-10-08 | 2020-09-08 | 2020-08-08 | 2020-07-08 | 2020-06-08 | 2020-05-08 | 2020-04-08 | 2020-03-08 | 2020-02-08 | 2020-01-08 | 2019-12-08 | 2019-11-08 | 2019-10-08 | 2019-09-08 | 2019-08-08 | 2019-07-08 | 2019-06-08 | 2019-05-08 | 2019-04-08 | 2019-03-08 | 2019-02-08 | 2019-01-08 | 2018-12-08 | 2018-11-08 | 2018-10-08 | 2018-09-08 | 2018-08-08 | 2018-07-08 | 2018-06-08 | 2018-05-08 | 2018-04-08 | 2018-03-08 | 2018-02-08 | 2018-01-08 | 2017-12-08 | 2017-11-08 | 2017-10-08 | 2017-09-08 | 2017-08-08 | 2017-07-08 | 2017-06-08 | 2017-05-08 | 2017-04-08 | 2017-03-08 | 2017-02-08 | 2017-01-08 | 2016-12-08 | 2016-11-08 | 2016-10-08 | 2016-09-08 | 2016-08-08 | 2016-07-08 | 2016-06-08 | 2016-05-08 | 2016-04-08 | 2016-03-08 | 2016-02-08 | 2016-01-08 | 2015-12-08 | 2015-11-08 | 2015-10-08 | 2015-09-08 | 2015-08-08 | 2015-07-08 | 2015-06-08 | 2015-05-08 | 2015-04-08 | 2015-03-08 | 2015-02-08 | 2015-01-08 | 2014-12-08 | 2014-11-08 | 2014-10-08

Key: Value:

Key: Value:

MESSAGE
DATE 2014-11-26
FROM Ruben Safir
SUBJECT Subject: [LIU Comp Sci] Problems with Normalization
From owner-learn-outgoing-at-mrbrklyn.com Wed Nov 26 01:51:56 2014
Return-Path:
X-Original-To: archive-at-mrbrklyn.com
Delivered-To: archive-at-mrbrklyn.com
Received: by mrbrklyn.com (Postfix)
id 14A93161143; Wed, 26 Nov 2014 01:51:56 -0500 (EST)
Delivered-To: learn-outgoing-at-mrbrklyn.com
Received: by mrbrklyn.com (Postfix, from userid 28)
id EF489161157; Wed, 26 Nov 2014 01:51:55 -0500 (EST)
Delivered-To: learn-at-nylxs.com
Received: from mailbackend.panix.com (mailbackend.panix.com [166.84.1.89])
by mrbrklyn.com (Postfix) with ESMTP id 11125161143
for ; Wed, 26 Nov 2014 01:51:54 -0500 (EST)
Received: from [10.0.0.42] (unknown [96.57.23.82])
by mailbackend.panix.com (Postfix) with ESMTP id 1246513AC1;
Wed, 26 Nov 2014 01:51:52 -0500 (EST)
Message-ID: <547578D7.1020604-at-panix.com>
Date: Wed, 26 Nov 2014 01:53:11 -0500
From: Ruben Safir
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.0
MIME-Version: 1.0
To: ping-Tsai Chung ,
Samir Iabbassen ,
learn-at-nylxs.com, ping-Tsai Chung
Subject: [LIU Comp Sci] Problems with Normalization
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Sender: owner-learn-at-mrbrklyn.com
Precedence: bulk
Reply-To: learn-at-mrbrklyn.com

http://www.bkent.net/Doc/manyform.htm


NOTE that is was an iEEE publication....


William Kent, "The Many Forms of a Single Fact", Proc. IEEE COMPCON,
Feb. 27-Mar. 3, 1989, San Francisco. Also HPL-SAL-88-8, Hewlett-Packard
Laboratories, Oct. 21, 1988. [13 pp]

The Many Forms of a Single Fact

------------------------------------------------------------------------

William Kent

1988

Or, have you heard the one about the traveling salesman's data?

> 1 INTRODUCTION . . . 2

------------------------------------------------------------------------

> 2 (SOME OF) THE DESIGN
OPTIONS . . . 3

>> 2.1 When each
salesman has one territory... . . . 3

>> 2.2 When each
territory has one salesman... . . . 7

>> 2.3 When it's
many-to-many... . . . 7

>> 2.4 When it's
many-to-few... . . . 8

>> 2.5 When it's
`usually' many-to-one... . . . 9

>> 2.6 When some people
don't have territories... . . . 11

>> 2.7 When there are
several connections... . . . 12

>> 2.8 Other cases... .
. . 13

> 3 WHY IT MATTERS . . . 13

>> 3.1 Data Interchange
. . . 13

>> 3.2 Data
Documentation (Dictionaries) . . . 14

>> 3.3 Data Design
Competence . . . 14

>> 3.4 Design Tools and
Methodologies: Input . . . 14

>> 3.5 Design Tools and
Methodologies: Output . . . 14

>> 3.6 Data Sharing By
New Applications . . . 15

>> 3.7 Redundancy
Management . . . 15

>> 3.8 Data Integration
. . . 15

>> 3.9 MIS Support . . . 15

>> 3.10 High-Level
Queries, Distributed Data . . . 15

>> 3.11
Entity-Relationship Theory . . . 16

>> 3.12 Conceptual
Schema, Three-Schema Architectures . . . 16

> 4 CONCLUSIONS . . . 17

> 5 REFERENCES . . . 17


1 INTRODUCTION

We hear much nowadays about the problems of sharing and interchanging
data. Diverse representations of data items need to be reconciled, such
as differing data types, field lengths, units of measure, precision,
etc. Different interpretations of the meanings of certain terms,
measurements, and quantities need to be reconciled. Transformations are
needed between various structural organizations, such as relational,
hierarchical, and network models, or various file formats, or the
various implementations of lists, vectors, arrays, matrices, or
different object models, and so on.

There is yet another dimension of variability, an unrecognized condition
that may in fact account for many problems in data sharing and
interchange. A simple fact, a simple connection between two things, can
be implemented in a remarkable variety of field configurations. This can
occur within the confines of any one structural form, and has little to
do with conversions between data representations.

There is an underlying fallacy, namely the assumption that a simple
binary fact (relationship or attribute) always maps simply into a pair
of fields. While that is the foundation of current data design
methodologies, there exist a troublesome number of exceptions in
practice. Thus what we have to say also bears on the current technology
of data analysis and design. We will also have something to say about
the implications for data dictionary, entity-relationship modelling, and
a few other points.

But first, let's trot out a catalog of examples, to get a concrete
feeling for the nature and extent of the situation. To show how
interchangeable the options are, we will take just one fact and map it
into all the configurations. We will show an incredible variety of
implementations for recording the fact that certain salesmen serve
certain territories, which might conceptually be represented thus:

------------ serves -------------
| SALESMAN |--------->| TERRITORY |
------------ -------------

The design options aren't all equally sensible for all facts. We may
have stretched a few of them for recording salesmen's territories, but
they are all plausible. Most of the options really are applicable to
most facts, even to numeric data (e.g., salaries), and even to other
"attributes" (e.g., birthdates). You are challenged to take any fact you
choose and see how many of these formats you can apply to it. And it
might be instructive to look at any database to see how many of these
techniques you can find there.

We limit ourselves to the design of records, as though they were in flat
files. With structured data models (e.g., hierarchies, networks), there
are even more design options. In a network or hierarchical database, the
connection between salesmen and territories might still be expressed in
one record using any of these techniques, or it might take the form of
owner/member records or parent/child records in the data structure.

Some examples might be redundant, and I've probably missed a few cases.
That's not the point. Whether there are two dozen or three dozen
alternatives, the problem's the same.

Furthermore, some of the examples implement variations in the semantics.
We'll get back to that point afterward.

------------------------------------------------------------------------

Many of these options can be combined, so that there are actually many
more possible designs than we show here.


2 (SOME OF) THE DESIGN OPTIONS


2.1 When each salesman has one territory...

[Option 1] A territory field in the salesman record. This is probably
the common and expected design for this fact:

---------------------------------------------
| SLSMN-NUM | name | dept | TERRITORY | ... |
=============--------------------------------

The double line under SLSMN-NUM will be our convention for key fields.

[Option 2] Unnormalized. The fact could occur in a record where neither
salesman nor territory is the key, e.g., in a record about orders:

-------------------------------------------------------------------
| ordernum | item | quantity | price | SALESMAN | TERRITORY | ... |
============-------------------------------------------------------

[Option 3] Multi-field representation. Sometimes a single entity, in
this case a territory, is represented by multiple fields. Now a simple
binary fact, relating a salesman and a territory, is actually occupying
four fields:

---------------------------------------------------------------
| SLSMN-NUM | name | dept | COUNTRY | REGION | DISTRICT | ... |
=============--------------------------------------------------
...............................
territory

Note that "TERRITORY" no longer appears as a field name. That's
significant if we are depending on some sort of dictionary to locate
information about salesmen's territories. This situation will arise in
many other options as well.

[Option 4] Embedded data. If a salesman's identifier included a code for
his territory (let's say salesmen never change territories), then there
might not have to be a territory field:

---------------------------------
| SLSMN-NUM | name | dept | ... |
=============--------------------
.....
:
territory code

[Option 5] Shared field. Sometimes several facts are encoded into one
field, e.g., a salesman's "assignment" might consist of an embedded
territory code and an embedded product code:

----------------------------------------------
| SLSMN-NUM | name | dept | ASSIGNMENT | ... |
=============---------------------------------
... ...
: :
: product code
territory code

[Option 6] Omitted field (data is in record type name). Salesmen's
records could be partitioned into record types by territory. The
territory identifier is part of the record type name:

---------------------------------
WESTERN-SALESMEN: | SLSMN-NUM | name | dept | ... |
=============--------------------

---------------------------------
EASTERN-SALESMEN: | SLSMN-NUM | name | dept | ... |
=============--------------------

That option is meaningful to the extent that record names are
documented, e.g., in a dictionary.

[Option 7] Omitted field (data is implicit). In records maintained at
territory headquarters, or at a small company with only one territory
(or no territory concept) - territory is a constant. It is the same for
all salesmen in the file, hence the territory is not identified anywhere
in the data:

---------------------------------
| SLSMN-NUM | name | dept | ... |
=============--------------------

When data is being maintained at the headquarters offices of each of the
territories, a salesman's territory is implied by the geographic
location of the record!

This is not so far-fetched. Employee records at a given company are
rarely explicit about who the employer is; bank records rarely record
the identity of the bank at which each account is held, and so on.

[Option 8] Omitted field (data is derivable from other data). Suppose
salesmen in the same department always serve the same territory. Then
the territory would be identified in the department record, not in the
salesman record. The salesman's territory is derivable, based on his
department:

---------------------------------
| SLSMN-NUM | name | dept | ... |
=============--------------------

---------------------------------------
| dept | name | mgr | TERRITORY | ... |
========-------------------------------

[Option 9] Default flag. Suppose that the files at territory
headquarters cover all salesmen assigned to that territory, and also
some salesmen outside the territory. The difference is indicated by a
simple binary flag in an "LCL" field, indicating whether the salesman is
"local" or not:

---------------------------------------
| SLSMN-NUM | name | dept | LCL | ... |
=============--------------------------

This kind of record provides partial information. We know the territory
of all the local salesmen, but not the others.

[Option 10] Multiplexed field. A given field might represent the
territory of a salesman - in some record occurrences. In other
occurrences, the field might serve a different purpose. Suppose we
extend the previous example so that for local salesmen we keep the
manager's name, but for other salesmen we keep their territories instead
- in the same field:

--------------------------------------------------
| SLSMN-NUM | name | dept | LCL | MGR/TERR | ... |
=============-------------------------------------

MGR/TERR has a conditional meaning: if LCL is true, then the field
contains the manager, otherwise it contains the territory.

[Option 11] Parametric form. This is essentially an extension of the
multiplexed form. The meaning of the field is again variable, but now
another field is provided to explain its meaning:

---------------------------------
| SLSMN-NUM | parameter | value |
|===========+-----------+-------|
| xx | TERRITORY | Texas |
| xx | manager | yy |
| xx | product | toys |
---------------------------------

Note that "TERRITORY" now appears in the data, not as a field name. Will
it be in the dictionary?

[Option 12] Self-describing form. This in turn is a variation of the
parametric form. In the parametric form, it was the meaning of the field
which varied from record to record, and hence had to be indicated in
another field. Sometimes other aspects of a field value can vary from
record to record, and hence have to be described in another field. This
can include such things as units of measure, scale factors, or field
lengths for variable length fields, such as the length of the TERRITORY
field:

---------------------------------------------------------
| SLSMN-NUM | name | dept | TERR-LNTH | TERRITORY | ... |
=============--------------------------------------------

[Option 13] Redundant records. Sometimes several of these forms will be
maintained, each optimized for a different purpose.

Exercise: in how many ways can such options be combined? For example, a
territory represented by multiple "fields" could also be embedded in the
salesman number:

------------------------------------------------
| SLSMN-NUM | name | dept | ASSIGNMENT | ... |
===============---------------------------------
... ... ...
: : :
: : district code
: region code
country code

How many other combinations make sense? Keep this question in mind as we
continue to develop more options.


2.2 When each territory has one salesman...

[Option 14] There could be a territory record with a salesman field:

------------------------------------------------------
| TERRITORY | hq-loc | size | SLSMN-NUM | name | ... |
=============-----------------------------------------

[Option 15] If it's a one-to-one relationship, then we can either put a
territory field in the salesman record or a salesman field in the
territory record - or both.

[Option 16] If the relationship is "in perpetuity", the territory might
be implicitly identified by its salesman. Thus we might speak of Smith's
territory and Jones's territory. We simply would not have any distinct
field for recording a salesman's territory.


It's not uncommon to identify things by related attributes or entities.
A common relational example about elections has just two fields, the
year of the election and the winning candidate. There is in fact no
distinct field in which the election itself occurs. (We are even doing
it here in this paper, when we identify a sales territory by the name of
the state it is in.)


2.3 When it's many-to-many...

[Option 17] Intersection records. The most common format for
many-to-many relationships:

-------------------------------
| TERRITORY | SLSMN-NUM | ... |
=========================------

The additional fields in this record might contain "attributes" of the
relationship between the salesman and the territory, e.g., a quota.

[Option 18] Repeating fields. If the data structure supports repeating
fields, we can have territory fields in the salesman record:

---------------------------------------------------------
| SLSMN-NUM | name | dept | TERRITORY | ... | TERRITORY |
=============--------------------------------------------

[Option 19] ...or salesman fields in the territory record:

----------------------------------------------------
| TERRITORY | hq-loc | SLSMN-NUM | ... | SLSMN-NUM |
=============---------------------------------------

[Option 20] ...or both (redundantly).


Of course, all of these options for many-to-many relationships could
also be used for one-to-many relationships, but the singleness on the
"one" side will not be enforced.


2.4 When it's many-to-few...

These are actually many-to-many relationships. But if the number of
things each entity can be related to is small or limited, then some
special options are available. Suppose that each salesman can only serve
a few territories.

[Option 21] Fixed repeating fields. If the limit on the number of
territories per salesman is, for example, three, then we can define
three territory fields in the salesman record:

------------------------------------------------------
| SLSMN-NUM | name | dept | TERR-1 | TERR-2 | TERR-3 |
=============-----------------------------------------

Here we have a fixed number of fields, in contrast to the earlier
example of repeating fields. There, the number of fields in a record
varied, depending on how many territories a salesman had. Here, each
record has the same number of territory fields, with some possibly blank.

In this design, a many-to-many relationship has been clustered with
other data without violating any normal form constraint.

[Option 22] If there are only a few territories altogether, then we can
actually define a separate field in the salesman record for each
territory, with a flag to indicate whether the salesman is assigned to
that territory:

------------------------------------------------------------
| SLSMN-NUM | name | dept | NEW YORK | CALIF | TEXAS | ... |
|===========+------+------+----------+-------+-------+-----|
| xx | | | no | yes | yes | |
------------------------------------------------------------

Note that in this case the field names are actually data values. The
dictionary entries for this record will reflect these "data values", but
not mention "territories".

[Option 23] All of a salesman's territories could be crammed into one
field. The field might contain a bit-vector, with a fixed position for
each territory:

-------------------------------------------
| SLSMN-NUM | name | dept | TERRITORIES |
|===========+------+------+---------------|
| xx | | | 0010001000000 |
-------------------------------------------

[Option 24] ...or the field might contain a string of unique codes for
the salesman's territories, i.e., a keyword rather than positional
notation:

-------------------------------------------
| SLSMN-NUM | name | dept | TERRITORIES |
|===========+------+------+---------------|
| xx | | | C,T |
-------------------------------------------

[Option 25] If a few of the territories have some special significance,
we might use a combination of options. A separate field can be defined
for each special territory, containing a binary flag, plus another field
with a flag to indicate whether the salesman has any other territories:

------------------------------------------------------------
| SLSMN-NUM | name | dept | NEW YORK | CALIF | OTHER | ... |
|===========+------+------+----------+-------+-------+-----|
| xx | | | no | yes | yes | |
------------------------------------------------------------

[Option 26] ...and there may or may not be an intersection record to
detail the "others":

-------------------------
OTHER-TERRITORIES: | SLSMN-NUM | TERRITORY |
|===========+-----------|
| xx | Texas |
-------------------------


Note that this record doesn't cover all territories or salesmen. It only
has entries for salesmen assigned to other than the special territories.


2.5 When it's `usually' many-to-one...

This is another variant of many-to-many relationships. It is the case
where most salesmen only have one territory.

The general approach here is to design the basic record as though there
was a many-to-one relationship between salesmen and territories, and
provide another record for the exceptions. There are several ways to
identify the exceptions:

[Option 27] Allow a special value in the TERRITORY field of the salesman
record. One consequence is that sometimes the value in this field will
not be the name of a territory:

---------------------------------------------
| SLSMN-NUM | name | dept | TERRITORY | ... |
|===========+------+------+-----------+-----|
| xx | | | New York | |
| yy | | | Calif | |
| zz | | | multi | |
---------------------------------------------

-------------------------
MULTPLE-TERRITORIES: | SLSMN-NUM | TERRITORY |
|===========+-----------|
| zz | Calif |
| zz | Texas |
-------------------------

[Option 28] To avoid such "pollution" of the TERRITORY field, we could
provide a separate field to flag the salesmen having multiple territories:

-----------------------------------------------------
| SLSMN-NUM | name | dept | MULTI | TERRITORY | ... |
|===========+------+------+-------+-----------+-----|
| xx | | | | New York | |
| yy | | | | Calif | |
| zz | | | x | | |
-----------------------------------------------------

-------------------------
MULTPLE-TERRITORIES: | SLSMN-NUM | TERRITORY |
|===========+-----------|
| zz | Calif |
| zz | Texas |
-------------------------

[Option 29] Another alternative is to divide salesmen into two kinds,
those with single territories and those with many:

---------------------------------------------
SINGLE-TERR SALESMEN: | SLSMN-NUM | name | dept | TERRITORY | ... |
=============--------------------------------

---------------------------------
MULTI-TERR SALESMEN: | SLSMN-NUM | name | dept | ... |
=============--------------------

-------------------------
MULTIPLE-TERRITORIES: | SLSMN-NUM | TERRITORY |
=========================


2.6 When some people don't have territories...

In effect, this amounts to various ways of dealing with subtypes. For
this example we will assume that salesmen have territories and other
employees don't.

[Option 30] Null values. The first general approach is to integrate
salesmen with other employees, and use any of the previously mentioned
options for all employee records. Territory information is simply left
blank, or otherwise omitted, for non-salesmen.

[Option 31] Separate records for special information. Common information
about salesmen is kept in employee records, with separate records
provided for the information that is unique to salesmen:

-------------------------------
EMPLOYEES: | EMP-NUM | name | dept | ... |
===========--------------------

-----------------------
SALESMEN: | EMP-NUM | TERRITORY |
===========------------

[Option 32] Separate records for each entity subtype:

--------------------------------------------
DRIVERS: | EMP-NUM | name | dept | license no | ... |
===========---------------------------------

-------------------------------------------
SALESMEN: | EMP-NUM | name | dept | TERRITORY | ... |
===========--------------------------------

[Option 33] Redundant combinations:

-------------------------------
EMPLOYEES: | EMP-NUM | name | dept | ... |
===========--------------------

-------------------------------------------
SALESMEN: | EMP-NUM | name | dept | TERRITORY | ... |
===========--------------------------------


2.7 When there are several connections...

Suppose that a salesman might have a home territory different from the
territory he serves. We have some variants of earlier options:

[Option 34] Parameterized connections:

--------------------------------------
| SLSMN-NUM | connection | TERRITORY |
|===========+------------+-----------|
| xx | serves | New York |
| yy | home | Calif |
--------------------------------------

[Option 35] Relationship fields:

-------------------------------------------------
| SLSMN-NUM | name | dept | HOME | SERVES | ... |
=============------------------------------------

Note that TERRITORY does not occur in a field name. The earlier simple
cases might also have been formatted as

------------------------------------------
| SLSMN-NUM | name | dept | SERVES | ... |
------------------------------------------

which leaves us wondering as to whether we might be talking about
serving territories, customers, products, meals, tables, or ...?

[Option 36] Default values. Strange things happen when two values are
usually the same, i.e., a salesman normally serves his home territory.
Instead of having the same value repeated so often in both fields, there
may be a default convention under which the SERVES field is used only
for exceptions. Thus, the territory a salesman serves is normally the
one in the HOME field; the SERVES field is only meaningful if it is not
blank:

-----------------------------------------------------
| SLSMN-NUM | name | dept | HOME | SERVES | ... |
|===========+------+------+----------+--------+-----|
| xx | | | New York | | |
| yy | | | Calif | | |
| zz | | | Calif | Texas | |
-----------------------------------------------------


2.8 Other cases...


[Option 37] There must be more.


3 WHY IT MATTERS

Our central observation isn't particularly novel or insightful; surely
such design options shouldn't come as a surprise to experienced data
processing professionals. But the sheer number and variety of design
options available for each and every simple fact might be a surprise.

Another thing not fully appreciated is the extent to which the various
options are used. We haven't perversely invented a bizarre collection of
rare and exotic practices. Most practical database designs employ a
sizable variety of these options. We don't have any supporting
statistics, and can only leave it to the reader to conduct his own survey.


Our observation has wide ranging consequences in many areas of database
theory and practice...


3.1 Data Interchange

As mentioned at the outset, this is a dimension of the data interchange
problem which doesn't seem to be getting enough attention, and yet may
be the cause of many difficulties.


Showing examples that implement variations in semantics isn't cheating;
it reflects a real problem in data interchange. We might be trying to
interchange data between sites (applications) that do have such
variations in policy. Some may have one territory per salesman, others
might have many. Some might have "invariant" assignments of salesmen to
territories, allowing them to embed the territory identifier in the
salesman identifier, while others do not. We still need to exchange data
among such sites.


3.2 Data Documentation (Dictionaries)

The solutions to most of the problems discussed here depend on some form
of data description, e.g., in data dictionaries.

Current approaches to data documentation focus on data elements. The
underlying assumption is that the appropriate way to document the
meanings of data is to describe the data elements.

The facts contained in a database don't always correspond to data
elements in a simple or direct way. They sometimes correspond to a
bundle of data elements, or to a fragment of one. Sometimes a data
element has multiple or conditional meanings, and sometimes a given fact
might be in any of several data elements. Sometimes the information
isn't in data elements at all, but in the field name, or the record
name, or otherwise implicit in the environment. And sometimes the
information is scattered over several record types.

Data elements certainly need to be described, but all of this suggests
that a new approach to data documentation may be needed, one which
doesn't simply tie one description to one data element.


We might also mention that some of the relevant information involving
record names and field names exists in the system catalogs of relational
databases, and is subject to all the difficulties mentioned herein.


3.3 Data Design Competence

If you are a data analyst or designer, would you have considered all
those possibilities?


Perhaps one measure of a data designer's competence is the number of
design options he can enumerate. Another measure might be his
understanding of the relative merits of each, and the criteria for
choosing among them. (That could be the subject of a very useful
follow-on to this paper.) Perhaps such a catalog of design options
should be part of a designer's training. At the very least, such a
catalog might serve as a helpful reference document during the design
process, to insure that all the appropriate alternatives had been
considered.


3.4 Design Tools and Methodologies: Input

Most data design tools and methodologies are driven by data elements as
their input, and they only deal with the problem of how to aggregate
those elements. They assume that data elements are merely collected, as
part of the specification of the user's requirements.


Data elements can be chosen in a variety of ways for a given
application. By requiring users to specify the data elements as part of
the input, current tools and methodologies have shifted much of the
design burden back on the user. It would be much more useful if design
tools and methodologies would help the user choose an appropriate set of
data elements for his application [K4].


3.5 Design Tools and Methodologies: Output

Current tools and methodologies tend to produce a single design as
output, largely driven by the particular form in which the user has
expressed his input. It would be more useful if such tools and
methodologies could consider all applicable designs. An ideal tool would
select the best of them, but it would also be useful to present the user
with a number of alternatives and perhaps help him to choose among them
[K3]. Users who can't or won't make such choices can simply choose the
first design offered.


Of course, one thing we have demonstrated is that the number of possible
designs is astronomical, since all these options are available for each
and every fact in the database. Clearly, practicing designers must
intuitively use some heuristic techniques to immediately trim these
options down to a few interesting alternatives. It would be useful to
identify and formalize such heuristics, and incorporate them into design
methodologies and tools. Expert system techniques would be most
appropriate.


3.6 Data Sharing By New Applications

The usual scenario for data sharing by a proposed application is to
identify the required data elements and then look in the dictionary to
see if such elements already exist. This is a common manifestation of
the underlying assumption that there's only one way to express a fact in
data elements.


Imagine that data about salesmen's territories existed in all of the
forms illustrated above, but you didn't know about it. If you were
planning an application that needed to do something with salesmen's
territories, would you have found all the existing data that you might
share? Would your dictionary help you with that?


3.7 Redundancy Management


If two or more of those forms had gotten implemented at your
installation, how would the redundancy get detected? Does your
dictionary help, or any other tool?


3.8 Data Integration


If you had to integrate databases like these, would you be prepared for
such a range of problems?


3.9 MIS Support


Suppose management wanted to find data about salesmen and their
territories. How would you locate the data in any or all of these forms?


3.10 High-Level Queries, Distributed Data

The nominal goal of various high-level database interfaces is to permit
users to interact with data without having to know details of its
implementation. Such interfaces are appropriate for user-friendly query
processors, where the user should be shielded from the details of
implementation, and also for distributed data, where the details of
implementation may vary from site to site for the same information.

Such a goal is far beyond the capabilities of any current or currently
proposed systems, as our examples illustrate.

A truly data-independent query interface should be able to take a given
query about salesmen's territories and extract a reply from /any/ of the
data formats illustrated above.

The problem for distributed data is much the same. Imagine that data
about salesmen were distributed over a number of sites, each using a
different one of the design options illustrated above. The challenge
here is to provide an interface which can uniformly process data about
salesmen in any and all of those files, masking all site-specific
differences.


This is a well-known problem when the different files use different
types of data structures (e.g., relational, hierarchical, network). What
we have illustrated here is that having the same type of data structure,
and even the same database product, does not eliminate the problem.
Severe problems of heterogeneity can exist between databases containing
the same information in the same type of data structure under the same
database processor.


3.11 Entity-Relationship Theory

There are many variants of entity-relationship models [Ch, K1], but
their general intent is to model the information contained in data, in a
data-independent manner. They model the "facts", as we have used the
term, such as the fact that a salesman serves a territory. The first
figure in this paper is a form of entity-relationship diagram.

Too many methodologies based on E-R theory presume a simple
correspondence between the E-R model and the data design. Typically,
they assume that each entity type corresponds to a record type, and each
attribute corresponds to a field. If such an assumption were valid,
there would only be one design for our salesman fact. We have shown
otherwise. There are many possible mappings from entities and
relationships to data designs.

In our examples, we have seen that territories may or may not have
records of their own. There may or may not be a distinct record type for
salesmen, and there might even be several. Data about an individual
salesman might be scattered over several records.

It could be claimed that each data design corresponds to a different E-R
model, e.g., to different choices of possible entity types. But this is
the argument that puts the design burden back on the user, since he then
has to choose the E-R model that generates the record design he wants.

Furthermore, that argument inhibits effective documentation of the
meanings of data. A very useful way to document the fact that different
databases contain the same information is to describe them in terms of
the same entities and relationships, regardless of differences in the
implementation formats.


We can also question the need for an "attribute" concept in E-R theory,
in the sense that some facts are relationships among entities while
others are attributes of entities. The principal significance of this
distinction seems to be that it leads to different data designs.
However, a given fact can always be viewed in either way. It may not
always be obvious whether, or why, a salesman's territory should be
perceived as a relationship or as an attribute. The choice among data
designs needs to be based primarily on data processing concerns, rather
than on artificial differences in "semantic views of reality". The full
range of design options may be relevant regardless of which way the user
describes his facts, and hence there doesn't seem to be an intrinsic
need for an attribute concept [K1, K4].


3.12 Conceptual Schema, Three-Schema Architectures

The primary purpose of a conceptual schema in a three-schema database
architecture [A] is to provide a data-independent view of the enterprise
whose data is being maintained in a database. Entity-relationship models
are appropriate for use in conceptual schemas.

By its central position in the three-schema architecture, the conceptual
schema is also expected to serve as a central data model. The stored
data configuration, and the external views of applications, map to the
conceptual schema. For this reason, the conceptual schema often appears
as a "logical data model", expressed in a form that is really data
elements and records.

What we have shown here is that such a logical data model is not the
same thing as a data-independent model of the enterprise, and the two
functions cannot be served by one and the same schema. A given
enterprise description can correspond to many logical data models, and
it should be possible to "tune" the latter without adjusting the former.
Even when the logical data model is in the form of fully normalized
relations, we do not have a unique representation. Many of our distinct
design options satisfied normal form constraints.


Thus we have argued [K2] that the functions of the conceptual schema
ought to be split into two schemas, yielding a four-schema database
architecture.


4 CONCLUSIONS

We have profusely illustrated the thesis that a given fact can be mapped
to data elements in many different ways, and we have outlined its
consequences in many aspects of database theory and practice. In
particular, we have demonstrated the need for additional
dictionary/directory facilities and design tools for the proper
administration of data.

We can't legislate away this diversity. Different sites or applications
may really have such variations in semantic constraints, as well as
different resource constraints (performance requirements) that justify
various implementations. And in any case, a lot of it just simply exists
already and can't be ignored.

What can be more readily legislated or negotiated is a common shared
semantic model of the information, together with a variety of mapping
specifications to link the semantic model with the data formats. Even
that is hard when the semantics vary from site to site. But given a
stable semantic description, we are still faced with a substantial
challenge to devise, and automate, a rich enough set of mappings from
semantic constructs to field formats.

And let me note that I didn't invent the complexity. Don't kill the
messenger.


What shall we do about the problem? We can be ostriches, ignoring its
existence, and continue to be baffled by the failures of various tools
and interchange conventions. We can try to devise total solutions, if we
had the time and patience. Or we can try to identify common and
tractable special cases for simple treatment - which is what we tend to
do anyhow, except we ought to stop thinking we've solved the whole problem.


5 REFERENCES

[A] /The ANSI/X3/SPARC DBMS Framework/, Report of the Study Group on
Data Base Management Systems, (D. Tsichritzis and A. Klug, editors),
AFIPS Press, 1977.

[Ch] P.P. Chen, /Entity-Relationship Approach to Information Modelling
and Analysis,/ North Holland, 1981.


[K1] W. Kent, /Data and Reality/, North Holland, 1978. [excerpts: html]


[K2] W. Kent, "Splitting the Conceptual Schema", Proc. Sixth Intl. Conf.
on Very Large Data Bases, Oct. 1-3, 1980, Montreal, Canada.

[K3] W. Kent, "Choices in Practical Data Design", Proc. Eighth Intl.
Conf. on Very Large Data Bases, Sept. 8-10, 1982, Mexico City, Mexico.

[K4] W. Kent, "Fact-Based Data Analysis and Design", , North Holland,
1983 (Davis, Jajodia, Ng, Yeh, eds.).

/Entity-Relationship Approach to Software Engineering/[html]


  1. 2014-11-08 Ruben <ruben.safir-at-my.liu.edu> Subject: [LIU Comp Sci] Re: Welcome to learn
  2. 2014-11-08 Ruben <ruben.safir-at-my.liu.edu> Subject: [LIU Comp Sci] second post!!
  3. 2014-11-22 Ruben Safir <mrbrklyn-at-panix.com> Subject: [LIU Comp Sci] Oracle Files for the Homework and Oracle Resources
  4. 2014-11-22 Ruben Safir <mrbrklyn-at-panix.com> Subject: [LIU Comp Sci] Oracle Webineir on pl/sql
  5. 2014-11-22 Ruben Safir <mrbrklyn-at-panix.com> Subject: [LIU Comp Sci] UEFI and Secure Boot
  6. 2014-11-24 Ruben <ruben.safir-at-my.liu.edu> Re: [LIU Comp Sci] Oracle DBA short cuts
  7. 2014-11-24 Ruben Safir <mrbrklyn-at-panix.com> Subject: [LIU Comp Sci] Oracle DBA short cuts
  8. 2014-11-24 Ruben Safir <mrbrklyn-at-panix.com> Subject: [LIU Comp Sci] Re: Database Management Systems: DBMS Announcement on Nov. 23
  9. 2014-11-26 Ruben Safir <mrbrklyn-at-panix.com> Re: [LIU Comp Sci] Problems with Normalization
  10. 2014-11-26 Ruben <ruben.safir-at-my.liu.edu> Subject: [LIU Comp Sci] =?UTF-8?B?UmU6IERhdGFiYXNlIE1hbmFnZW1lbnQgU3lzdGVtczogQ1MgNjQ5IEE=?=
  11. 2014-11-26 Ruben Safir <mrbrklyn-at-panix.com> Subject: [LIU Comp Sci] More problems with modeling and normalization
  12. 2014-11-26 Ruben Safir <mrbrklyn-at-panix.com> Subject: [LIU Comp Sci] Normalize because your professor said too
  13. 2014-11-26 Ruben Safir <mrbrklyn-at-panix.com> Subject: [LIU Comp Sci] Problems with Normalization
  14. 2014-11-26 Ruben Safir <mrbrklyn-at-panix.com> Subject: [LIU Comp Sci] Why Data Models Shouldn't Drive Object Models (And Vice Versa)
  15. 2014-11-26 Ruben Safir <mrbrklyn-at-panix.com> Subject: [LIU Comp Sci] Why Normalization Failed to Become the Ultimate Guide for Database
  16. 2014-11-29 Ruben Safir <mrbrklyn-at-panix.com> Subject: [LIU Comp Sci] Memory Cache theory Architecture Class
  17. 2014-11-30 Ruben Safir <mrbrklyn-at-panix.com> Subject: [LIU Comp Sci] Cache Model in C programming

NYLXS are Do'ers and the first step of Doing is Joining! Join NYLXS and make a difference in your community today!