Data Integration Musings, Circa 1991

I recently stumbled over this very old text. It is really just notes and musings, but thought it was interesting to see some of my earliest thoughts on the data integration problem. Presented as is.

Mechanical Symbol Systems

To what extent can knowledge be thought of as sentences in an internal language of thought?
Should knowledge by seen as an essentially biological, or essentially social, phenomenon?
Can a machine be said to have intentional states, or are all meanings of internal machine representations essentially rooted in human interpretations of them?

Robot Communities

How can robots and humans share knowledge?
Can artificial reasoners act as vehicles for knowledge transfer between humans? (yes, they already are – see work on training systems)

Human Symbol Systems
Structures: Concepts, Facts and Process
Human Culture
Communication Among Individuals

Discourse

The level of discourse among humans is very complex. Researchers in the natural language processing field would tell you that human discourse is very hard to capture in computer systems. Humans of course have no problem following the subject changes and shifting contexts of discourse.
Language is the means through which humans pass information to one another. Historically, verbal communication has been the primary means of conveying information. Through verbal communication, parents teach their children, conveying not just facts, but also concepts and world view. Through socialization, children learn the locally acceptable way in which to exist in the world. Through continual human contact, all persons reinforce their understanding of the world. Culture is a locally defined set of concepts, facts and processes.

Myth

One of the most important transmission devices for human communication is myth. Myth is story-telling, and therefore is largely verbal in nature.

Ritual

Ritual also is used to communicate knowledge and reiterate beliefs among individuals. Ritual is performance, and can be used to teach process.

Information Systems Structures: Concepts, Facts and Process

The conceptual level of a standard information system may be stored in a database’s data dictionary. In some cases, the data dictionary is fairly simplistic, and may actually be hidden within the processes which maintain the database, inaccessible to outside review except by skilled programmers. More sophisticated data dictionaries, such as IBM’s Repository, and other CASE tools, make explicit the machine-level representation of the data contained in the system. The concepts stored in such devices are largely elementary, and idiosynchratic.
They are elementary in that a single concept in a data dictionary will generally refer to a small item of data called variously a “column” or a “field”. What is expressed by a single entry in a data dictionary is a mapping from an application-specific concept, for instance “PART_NUMBER”, to a machine-dependent, computable format (numeric, 12 decimal digits).
A “fact” in a database sense is a single instance or example of a data dictionary concept coupled with a single value.

Communication Across Information Systems, Custom Approaches

Information systems typically have no provision either to generate or understand discursive communication. Typically, information shared between two information systems must be rigidly defined long before transmission begins. This takes human intervention to define transmission carriers, as well as format, and periodicity.

Networks

The ISO OSI seven layers of communication was an initial attempt at defining the medium of computer communication. All computers which required communications services faced the same problems. Much of the work in networking today is geared toward building this ability to communicate. For humans, communication is through the various senses, taking advantage of the natural characteristics of the environment and the physical body. The majority of computers do not share the same senses.
Distributed systems are those in which all individual systems are connected via a network of transmission lines, and in which some level of pre-defined communication has been developed. The development of distributed database systems represents the first steps toward homogenation of mechanical symbol systems.

Electronic Data Interchange

EDI takes the communication process a step farther by introducing a rudimentary level of discourse among individual enterprises. Typically discourse is restricted to payments and orders of material, and typically these interchanges are just as static as earlier developments. The difference here is that human intervention is slowly developing a cultural definition of the information format and content that may be allowed to be transferred.
As standards are developed describing the exact nature and structure of the information that any company may submit or recieve, more of a culture of discourse can be recognized in the process overall. The discourse is of course carried out by humans at this point, as they define a syntax and semantics for the proper transmission of information in the domain of supply, payment, and delivery (commerce).
Although it is ridiculous to talk of an “EDI culture” as a machine-based, self-defining, self-reinforcing collection of symbols in its own right, it is a step in that direction. What EDI, and especially the development of standards for EDI transmissions, represents is an initial attempt to define societal-like communication among computers. In effect, EDI is extending the means of human discourse into the realm of high-speed transaction processing. The standards being developed for the format and type of transactions allowed represent a formalization and agreement among the society of business enterprises on the future language of commerce.

Raising Consciousness in Mechanical Symbol Systems

In order to partake of the richness and flexibility of human symbol systems, machines must be given control of their own senses. They must become aware of their environment. They must become aware of their own “bodies”. This is the mind-body problem.
mission lines, and in which some level of pre-defined communication has been developed. The development of distributed database systems represents the first steps toward homogenation of mechanical symbol systems.

Electronic Data Interchange

EDI takes the communication process a step farther by introducing a rudimentary level… (Author note: transcript cuts off right here)

The Folk Model – What We Really Build Software From

The anthropological notion of a “folk model” can be a useful paradigm to consider when analyzing the implementation of software applications. Folk models are the proto-scientific conceptualizations of a group of people which they use to describe, understand and interact some aspect of their collective experience.

When writing software, especially but not only within the Agile approach, it is the through the elicitation and joint “discovery” of the user’s folk model that a common set of requirements for the software is defined. Ultimately, it is the closeness of fit between the folk model and the operation and symbology of the software that will determine its success or failure.

Different groups of people faced with the same or similar problems may develop largely similar folk models, and from these, different software development teams may create largely similar software applications. This is one reason why the software development process works best as a hand-crafted enterprise.

But what at first appears to be minor discrepancies between what the software model presents and what the folk model expects can grow so large that it can cause the failure of the software for those users. Especially if the folk model was flawed or in a state of flux at the time the software tried to codify it (and really, when is a folk model not in flux?).

You Can’t Store Meaning In Software

I’ve had some recent conversations at work which made me realize I needed to make some of the implications of my other posts more obvious and explicit. In this case, while I posted awhile ago about How Meaning Attaches to Data Structures I never really carried the conversation forward.

Here is the basic, fundamental mistake that we software developers make (and others) in talking about our own software. Namely, we start thinking that the data structure and programs actually and directly hold the meaning we intend. That if we do things right, that our data structures, be they tables with rows and columns or POJOs (Plain Old Java Objects) in a Domain layer, just naturally and explicitly contain the meaning.

The problem is, that whatever symbols we make in the computer, the computer can only hold structure. Our programs are only manipulating addresses in memory (or disk) and only comparing sequences of bits (themselves just voltages on wires). Now through the programming process, we developers create extremely sophisticated manipulations of these bits, and we are constantly translating one sequence of bits into another in some regular, predictable way. This includes pushing our in-memory patterns onto storage media (and typically constructing a different pattern of bits), and pushing our in-memory patterns onto video screens in forms directly interpretable by trained human users (such as displaying ASCII numbers as characters in an alphabet forming words in a language which can be read).

This is all very powerful, and useful, but it works only because we humans have projected meaning onto the bit patterns and processes. We have written the code so that our bit symbol representing a “1” can be added to another bit symbol “1” and the program will produce a new bit symbol that we, by convention, will say represents a value of “2”.

The software doesn’t know what any of this means. We could have just as easily defined the meaning of the same signs and processing logic in some other way (perhaps, for instance, to indicate that we have received signals from two different origins, maybe to trigger other processing).

Why This Is Important

The comment was made to me that “if we can just get the conceptual model right, then the programming should be correct.”  I won’t go into the conversation more deeply, but it lead me to thinking how to explain why that was not the best idea.

Here is my first attempt.

No matter how good a conceptual model you create, how complete, how general, how accurate to a domain, there is no way to put it into the computer. The only convention we have as programmers when we want to project meaning into software is that we define physical signs and processes which manipulate them in a way consistent with the meaning we intend.

This is true whether we manifest our conceptual model in a data model, or an object model, or a Semantic Web ontology, or a rules framework, or a set of tabs on an Excel file, or an XML schema, or … The point is the computer can only store the sign portion of our symbols and never the concept so if you intend to create a conceptual model of a domain, and have it inform and/or direct the operation of your software, you are basically just writing more signs and processes.

Now if you want some flexibility, there are many frameworks you can use to create a symbollic “model” of a “conceptual model” and then you can tie your actual solution to this other layer of software. But in the most basic, reductionist sense, all you’ve done is write more software manipulating one set of signs in a manner that permits them to be interpreted as representing a second set of signs, which themselves only have meaning in the human interpretation.

Context and Chomsky’s Colorless Green Ideas

Language is code. The speaker chooses the terms, sequence and intonations of their speech with the hope that the listener shares enough of the same human experience to recognize the intended meaning. Conversation is a negotiation as much as anything else. In conversation, the participants can adjust the selection of terms and details until they all reach an understanding of what is being said. This is the practical meaning of “context”, then.

Many years ago, in an effort to make a point about how syntax is different from semantics, Noam Chomsky once proposed the following sentence as an example of a grammatically correct sentence that had no discernible meaning:

Colorless green ideas sleep furiously.

In the context within which Chomsky was writing this sentence, reflective of common cultural experience of these terms among a broad community of American society, he made the claim that the sentence had no meaning. Since that time, other scholars have suggested that there may be contexts in which this construction of terms may actually be meaningful.

Here’s a quote from the english language version of Wikipedia from August 1, 2005:

This phrase can have legitimate meaning to English-Spanish bilinguals, for whom there are double-entendres about the word “green” (meaning “newly-formed”) and “sleep” (used as a verb of non-experience). An equivalent sentence [in the context understood by these English-Spanish bilinguals] would be “Newly formed, bland ideas are unexpressible in an infuriating way.”

This little example provides an excellent case study of the role context plays in communication. Never mind the fact that the sentence was first defined in a context for which it held no meaning. Since the moment of its invention, other contexts have either been recognized or constructed around the sentence in which it holds meaning.

The notion of “context” as that mileiu which drives the interpretation of a sentence such as this is the same notion that explains how the meaning of any coded message must be interpretted. This would include messages encoded in the data structures of computer systems. Data within a omputer system is constructed within and in order to support specific information recordation and transmittal of things important to a specific context. This context is the tacit agreement between the software developers and the business community on what the “typical interpretation” of those computer symbols should be.

The importance of context to the understanding of the data integration problem cannot be understated (which is why I keep coming back to it on this blog). While many theorists recognize the role context plays, and many pundits have written about the failures of computer systems when context has been ignored or mishandled, practitioners continue to develop and deploy applications with little explicit attention to context.

All computer applications written in business today are written from some point of view. This point of view establishes the context of the system. Most developers would agree with these statements. The trick is to define a system which allows the context of the system to change and evolve over time, as the business community learns and invents it. It must be a balancing act between excluding the software equivalent of Chomsky’s meaningless statement, and allowing the software to adapt as the context shifts to allow real meaning to be applied to those structures.

Example of How Meaning Is Attached to Structure

What follows is a detailed example of the thought process followed by a software developer to create a class of data structures and how meaning is attached to those structures.

Consider that the meaning of one data structure may be composed of the collection of meanings of a set of smaller structures which themselves have meaning. Take the following description as the meaning to be represented by a structure:

An employee is a human being or person. Each employee has a unique identity of their own. Each employee has a name, which may be the same as the name of a different person or employee. Being human, each employee has an age, calculated by counting the number of years since they were born up to some other point in time (such as present day).Each person of a certain age may enter into a marriage with another human being, who in turn also has their own identity and other attributes of a person.

To represent this information using data structures (i.e., to project the meaning of this information onto a data structure), we might tie the various concepts about a human being/employee to a computer-based data structure. Recognizing that a human being is an object with many additional characteristics of which we might want to know about, we might choose to project the concept of “human beings” or “people” onto a relational table and the concept of a particular individual onto one of that table’s rows (or a similar record structure).

This table would represent a set of individual human beings, and onto each row of the table would be projected the meaning of a particular human being. Saying this again in a more conventional manner, we would say that each row of the table will reference a singular and particular human being, the all of the rows will represent the set of all human beings we’ve observed in the context of our usage of the computer system.

In a more mathematical vein, we would define a projection Þ from the set of actual human beings Α onto Š, (Þ(Α) |–> Š), the set of data structures such that for any α in Α where α is a human being, there is a record or row σ in Š that represents that human being.

A record data structure being a conglomeration of fields, each of which can symbolically represent some attribute of a larger whole, then we might project additional attributes of the human being, such as their name and identifier, to particular fields within the record. If σ is the particular record structure representing a particular human being, α, then the meaning (values) of the attributes of that person could be associated with the fields, f1..fn, of that record through attribute-level projections, ψ1..ψn for attributes 1 .. n.

To represent a particular person, first we would project the reference to the person to a particular row, Þ(α) |–> σ, then we would also project the attribute facts about that person onto the individual fields of that row:

ψ1(α.1) |–> σ.f1

ψn(α.n) |–> σ.fn

Projection onto Relational Structure

When modeling a domain for incorporation into computer software, the modeler’s task is to define a set of structures which software can be written to manipulate. When that software is to use relational database management systems, then the modeler will first project the domain concepts onto abstract relational structures defined over “tuples”. These abstract structures have a well-defined mathematical nature which if followed provides very powerful manipulations. The developer projects meaning onto relations in a conventional way, such as by defining a relation of attributes to represent “PERSON” – or the set of persons, and another relation of attributes to represent “EMPLOYEE” – or the set of persons who are also employees. Having defined these relational sets, the relational algebra permits various mathematical operations/functions to be applied, such as “JOIN” and “INTERSECTION”. These functions have strictly defined properties and well-defined results over arbitrary tuples. The software developer having projected meaning onto the individual relations, he is also therefore able to project meaning on the outcomes of these operations which can then be used to manipulate large sets of data in an efficient, and semantically correct way.

As the developer creates the software however, they must keep in mind what these functions are doing on two levels, at the level of the set content and at the level of the represented domain (the referent of the sets and manipulations). Thus the intersection of the PERSON and EMPLOYEE relations should produce the subset of tuples (records, etc.) which has its own meaning derived from the initial projected meaning of the original sets. Namely, this intersection represents the set of PERSONS who are also EMPLOYEES, (which is the same, alternatively, as the set of EMPLOYEES who are also PERSONS). This is an important point about software: the meaning is not simply recorded in the data structure but the manipulations of the data by the computer themselves have specific connotations and implications on the meaning of data as it is processed.

Representational Redundancy

As a typical practice in the projection of information onto data structures within the relational model, there will usually be a repetition of the information projected onto more than one symbol. In particular, the reference to the identity of a single person will be represented both by the mere existence of a single row in the table, and also by a subset of fields on the row which the software developers have chosen (and which the software enforces) for this purpose. In other words, under common software development practices, each record/row as a conglomerate entity will represent a single person. In addition, there will be k attributes (1 <= k <= n) on that record structure whose values in combination also represent that same individual. These k attributes make up the “primary key” of the data structure. The software developer will use and repeat these columns on multiple data structures to permit additional concepts regarding the relationship between that person and other ideas also being recorded. For example, a copy of one person record’s primary key could be placed on another person record and be labelled “spouse”. The attributes which make up the primary key often have less mechanical meanings as well (for example, perhaps the primary key for our person includes the name attribute. As part of the primary key, the name value of the person merely helps to reference that person. It also in its own right represents the name of the person.

How Meaning Attaches to Data Structures: A Summary

What follows is a high level summary of how humans attach meaning to various kinds of data structures within a computer. It will serve as a good baseline account, though certainly not an exhaustive one, providing a model upon which more detailed dicussion can begin. 

 Background Terminology

Computer systems provide functionality to support the performance and record of business processes. They do that through three inter-related features: DATA, LOGIC, and PRESENTATION. The presentation consists of information displays permitting both an information visualization aspect and an information capture aspect. The logic consists of several aspects, much of it having to do with support of the presentation and manipulation of displays, but also a lot of it having to do with creation, transformation and storage of data. Data consists of sets of symbols constructed in a systematic, regular fashion using a set of data structures. Different data structures are constructed to represent different aspects of the recorded activity. It is in the relationships between the macro and micro structures where the specific detailed information captured.generated by the business process resides. By following a codified, rigid construction of its data structures, the computer system is able to record multiple recurring instances of similar events. Through the development of fixed transformations using program logic, the computer system is able to make routine, conventional conclusions about those events or observations, and it is able to maintain and retain those observations virtually indefinitely.
Data is maintained and stored in DATA STRUCTURES. The more regular these data structures are, the more easily they are interpreted by a broad audience of software developers. In most situations, the PRESENTATION of the data captured by a system to the end user of that system is in a more directly understandable form than the way that information is stored in the computer.  (This statement is not only trivially true, but in a very deep sense too, since the computer actually stores everything using more and more complex sequences of binary digits. That’s a different subject than our current presentation.)  The data structures within the computer system typically exist in two, simultaneous forms, one intended to support human reasoning (through what is often called a “logical”, “abstract” or “conceptual” model) and one supporting manipulations by the computer. Most software developers today strictly deal with the abstract model of the data for design, coding, and discussion. (There are still some developers working in assembly level code, but even that is at a more abstract level than the actual electro-mechanical machinations of the actual hardware!)
An obvious observation, at least on its face, is that different computer systems will store data representing similar ideas using different structures. We need to keep this in the back of our minds as we progress through the rest of this discussion, but it will be more directly adressed in other entries.
 A final thought concerns sets of data of similar structure, called a POPULATION. A population of data consists of some set of data symbols, all constructed using the same data structure pattern which represents a set of similar ideas. The classification of populations of data structures applies to the DATA portion of systems, represents an analogous classification of sets of observed events external to the computer system, and is affected by and affecting the LOGIC and PRESENTATION portions of the computer system. A more detailed definition of the notion of a “population” will also be treated in separate sections.

Commonalities of Structure

Many computer systems, especially those built in support of business (or other human activity) processes, are constructed using a conventional system of abstract data structures. (When I say they are “conventional” what I mean is that the majority of software developers follow conventional patterns for the construction of data structures to represent their idiosynchratic subject areas.) Whether these structures are called “objects”, “tables”, “records”, or something else, they typically take the form of a heterogenous collection of smaller structures grouped together into regular conglomerations. Instances or examples of the larger collections of data structures will each be said to “represent” individual intances of some real-world conglomerate. Each of the individual component element structures of these conglomerations will each be said to represent the individual attributes or characteristics of the real-world conglomerate object. In order to permit efficient processing by the computer,   instances of similar phenomenon will be represented by the same kind of conglomeration.
Typically, business systems will be based on a data structure called a RECORD.  Records consist of a series of “attribute data structures” all related in some fashion to each other. (A more complex structure called an “object” still has record-like attributes combined together to represent a larger whole, the nuances and variation of object-based representation is a subject for later.)  Each RECORD will stereotypically symbolize one instance of a particular concept. This could be a reference to and certain observed details of a real-world object, or it could be something more ephemereal like observations of an event. For example, one “PERSON” record would represent a single individual person.
RECORDS themselves consist of individually defined data elements or FIELDS. Each RECORD of a particular type will share the same set of FIELDS. Each FIELD will symbolize one kind of fact about the thing symbolized by the RECORD. For example, a NAME field on a PERSON record will record what the represented individual’s name is, at least as it was at the time the record was created. 
The set of all records within a system having the same structure will typically be collected and stored together, often in a data structure called a TABLE. Each TABLE will symbolize the set of KNOWN INSTANCES of whatever type of thing each record represents. TABLES are also described as having ROWS and COLUMNS. Each row of a table is one RECORD. The set of shared element-attribute structures across the set of  rows can be described as the “columns” of the table. Each column represents the set of all instances of a FIELD in the table, in other words, the same field across all records. Tables are a commonly used data structure because they readily support interpretation using relational algebra and set theoretic operations, as well as being easily presented and understood both by human and computer.  

Basic Data Structures and Their Relationships

The nomenclature of “record”, ” table”, “row”, “column” and “fields” describes the construction building blocks of an abstract syntactic medium whose usage permits humans to represent complex concepts within the computer system. By assigning names to various collections and combinations of these generic structures, humans project meaning onto them. Using diagrams called “data models”, a short hand of sorts allows the modeler to describe how the generic tables and fields relate to each other and what these relationships signify in the external world. These models also, by virtue of the typified short hand they use, allows for the generation of computer logic that can be applied to a database to support certain standard operations and manipulations of the data generated by a computer system.

Traditional data modeling results in the creation of a data dictionary which relates each structural element to a particular kind of concept. Every structure will be given a name, and if the developers are diligent, these can be associated with more fully realized text descriptions as well. Some aspects of the data structures are not described, at least typically, within a data model, such as populations or subsets of records with similar structures.

Traditional data dictionary entries record name and description of the set of all structures contained in a table. Using a set of structures to represent a set or collection of similar objects is itself a symbolic action. So not only does each row in a table represent one instance of some type of thing, and each column represents one observed (or derived) fact or attribute of that instance, but the collection of all instances of these row data structures also represents the logical set or population of these things.

The strategy for applying meaning to these data structures begins when the decision is made to treat the entirety of each record as the representation of a member of a population of like things. Being similar, then, a set of fields is conceived to capture various detailed observations regarding the things. These fields are intended to capture details about both how each thing is different from the other things in the collection, but also how different things may share similarities. Much of the business logic of the application system will be consumed by the comparisons between individual things, and the mathematical derived counts (and other metrics) of those sets of things (and of subsets within). Using the computer to compare the bit sequences contained in each field, the computer will indicate whether these contents are the same or different between different instances. Humans will then interpret the results of these comparisons by projecting the conclusion out of the computer and into the conceptual world.

For example, let’s say that we have defined the computer sequence “10101010” to represent a reference to a specific person, “Julie Smith”. If we take two different instances of bit sequences and compare them in the computer, the computer will tell us if they are the same or not. As humans, we would then interpret the purely electro-mechanical result which the computer calculated that “10101010” and “10101010” are the same as an indication that the two instances of these sequences represent the same specific person. Likewise, we would interpret a computer result indicating that two bit sequences were not the same as an indication that different people were being referred to.  This type of projection of meaning from mechanical result to logical inference is fundamental to the way humans use computers.

The specific number of fields and their bit sequence representations (data types)  that are developed within a computer application is entirely dependent on the complexity of the problem domain and the attributes of the objects required to reason over that domain. However, no matter how simple or complex, it is the projection of meaning onto the representation of these attributes in the computer and the projection of an interpretation onto the results of the computer comparisons of the physical representations which makes the computer the powerful engine that it is in our society.

How Row Subsets Represent Subpopulations
How Row Subsets Represent Subpopulations

 

A Long Time Ago…

I just came across this pearl of insight that I wrote a long time ago. I think it still stands:

The problem of understanding historical data and its meaning is both one of determining the user’s understanding and acceptance of the data and determining the flexibility of the supporting software. If a record, as understood by the user community, represents a particular concept in a particular way, the desire to re-use the structure implies that a change in the user culture will be required. If the system itself has built-in constraints as well supporting the accepted meaning, then the problem is in the system’s ability to accomodate new meaning, not just in the user’s willingness to accept new meaning. Where both aspects to the historical data problem exist, it should be easier in the long run not to change the meaning of a structure, but rather to implement a new structure with the desired meaning.

Howe, Geoffrey A. and Dr. Geof Goldbogen. “The Integration Analysis Filter: A Software Engineering Technique for Integrating Old and New.” Proceedings of the Fourth International Conference, Expert Systems in Production and Operations Management, May 14-16, 1990.

%d bloggers like this: