Data Integration Musings, Circa 1991

I recently stumbled over this very old text. It is really just notes and musings, but thought it was interesting to see some of my earliest thoughts on the data integration problem. Presented as is.

Mechanical Symbol Systems

To what extent can knowledge be thought of as sentences in an internal language of thought?
Should knowledge by seen as an essentially biological, or essentially social, phenomenon?
Can a machine be said to have intentional states, or are all meanings of internal machine representations essentially rooted in human interpretations of them?

Robot Communities

How can robots and humans share knowledge?
Can artificial reasoners act as vehicles for knowledge transfer between humans? (yes, they already are – see work on training systems)

Human Symbol Systems
Structures: Concepts, Facts and Process
Human Culture
Communication Among Individuals

Discourse

The level of discourse among humans is very complex. Researchers in the natural language processing field would tell you that human discourse is very hard to capture in computer systems. Humans of course have no problem following the subject changes and shifting contexts of discourse.
Language is the means through which humans pass information to one another. Historically, verbal communication has been the primary means of conveying information. Through verbal communication, parents teach their children, conveying not just facts, but also concepts and world view. Through socialization, children learn the locally acceptable way in which to exist in the world. Through continual human contact, all persons reinforce their understanding of the world. Culture is a locally defined set of concepts, facts and processes.

Myth

One of the most important transmission devices for human communication is myth. Myth is story-telling, and therefore is largely verbal in nature.

Ritual

Ritual also is used to communicate knowledge and reiterate beliefs among individuals. Ritual is performance, and can be used to teach process.

Information Systems Structures: Concepts, Facts and Process

The conceptual level of a standard information system may be stored in a database’s data dictionary. In some cases, the data dictionary is fairly simplistic, and may actually be hidden within the processes which maintain the database, inaccessible to outside review except by skilled programmers. More sophisticated data dictionaries, such as IBM’s Repository, and other CASE tools, make explicit the machine-level representation of the data contained in the system. The concepts stored in such devices are largely elementary, and idiosynchratic.
They are elementary in that a single concept in a data dictionary will generally refer to a small item of data called variously a “column” or a “field”. What is expressed by a single entry in a data dictionary is a mapping from an application-specific concept, for instance “PART_NUMBER”, to a machine-dependent, computable format (numeric, 12 decimal digits).
A “fact” in a database sense is a single instance or example of a data dictionary concept coupled with a single value.

Communication Across Information Systems, Custom Approaches

Information systems typically have no provision either to generate or understand discursive communication. Typically, information shared between two information systems must be rigidly defined long before transmission begins. This takes human intervention to define transmission carriers, as well as format, and periodicity.

Networks

The ISO OSI seven layers of communication was an initial attempt at defining the medium of computer communication. All computers which required communications services faced the same problems. Much of the work in networking today is geared toward building this ability to communicate. For humans, communication is through the various senses, taking advantage of the natural characteristics of the environment and the physical body. The majority of computers do not share the same senses.
Distributed systems are those in which all individual systems are connected via a network of transmission lines, and in which some level of pre-defined communication has been developed. The development of distributed database systems represents the first steps toward homogenation of mechanical symbol systems.

Electronic Data Interchange

EDI takes the communication process a step farther by introducing a rudimentary level of discourse among individual enterprises. Typically discourse is restricted to payments and orders of material, and typically these interchanges are just as static as earlier developments. The difference here is that human intervention is slowly developing a cultural definition of the information format and content that may be allowed to be transferred.
As standards are developed describing the exact nature and structure of the information that any company may submit or recieve, more of a culture of discourse can be recognized in the process overall. The discourse is of course carried out by humans at this point, as they define a syntax and semantics for the proper transmission of information in the domain of supply, payment, and delivery (commerce).
Although it is ridiculous to talk of an “EDI culture” as a machine-based, self-defining, self-reinforcing collection of symbols in its own right, it is a step in that direction. What EDI, and especially the development of standards for EDI transmissions, represents is an initial attempt to define societal-like communication among computers. In effect, EDI is extending the means of human discourse into the realm of high-speed transaction processing. The standards being developed for the format and type of transactions allowed represent a formalization and agreement among the society of business enterprises on the future language of commerce.

Raising Consciousness in Mechanical Symbol Systems

In order to partake of the richness and flexibility of human symbol systems, machines must be given control of their own senses. They must become aware of their environment. They must become aware of their own “bodies”. This is the mind-body problem.
mission lines, and in which some level of pre-defined communication has been developed. The development of distributed database systems represents the first steps toward homogenation of mechanical symbol systems.

Electronic Data Interchange

EDI takes the communication process a step farther by introducing a rudimentary level… (Author note: transcript cuts off right here)

Root Causes of the Data Integration Problem

The Fundamental Phenomenon – Human Behavior

4/24/2005

Writing over a century ago, Emile Durkheim and Marcel Mauss recognized and documented the true root cause of today’s data integration woes. (Primitive Classification, 1903, page 5-6 as quoted by Mary Douglas in Natural Symbols, page 61-62)

At the bottom of our conception of class there is the idea of circumscription with fixed and definite outlines. 

Given that this concept of classification is the basis of logic, social discourse, religion and ritual, it should not be a surprise that it also comes into play when software developers write software. They make assumptions and assertions in the design, data and code of their systems that rely on a fixed vision of the problem. Applications may be written for maximum flexibility in some ways, and still there is an intent on the part of the developers to define the breadth and width of the system,  in other words, to bound and fix in place the concepts and relations supportable by the application.

The highly successful ERP products like SAP, JD Edwards, and ORACLE Financials allow tremendous flexibility to configure for different business practices. The breadth of businesses that can make these products work for them is very large. However, it is a common understanding in the ERP professional community (of installers) that there are some things in each product that just can’t be changed or accomplished. In these areas, the business is said to have to change to accommodate the tool. The whole industry of “change management” was born from the need to change the PRACTICE of business due to the ultimate limitations of these systems which were imposed by the conceptual boundaries their authors had to place upon them. (This is a different subject which should be pressed and researched). No matter how flexible the business system is, it is ultimately, and fundamentally, a fixed and bounded symbolic system.

 So how does this relate to my claim that Durkheim and Mauss have unwittingly predicted the current crisis of data integration? Because they go on to point out that: 

It would be impossible to exaggerate, in fact, that state of indistinction from which the human mind developed. Even today a considerable part of our popular literature, our myths, and our religions is based on a fundamental confusion of all images and ideas. They are not separated from each other, as it were, with any clarity. 

This “conceptual stew” is present in every aspect of life. The individual human mind is particularly adept at working within this broad confusion, picking and choosing what to believe is true based on internal processes. Groups of individuals, in order to communicate, will add structure and formality to certain portions thru discussion and negotiation. But this “social” activity is not always accompanied by strong enforcement by the community.

 As Mary Douglas (Natural Symbols, page 62) continues from Durkheim and Mauss, individuals in modern society (and increasingly this encompasses the global community) are presented with many different conceptual mileaus during the course of a single day. Within each person, she indicates,

 A classification system can be coherently organized for a small part of experience, and for the rest it can leave the discrete items jangling in disorder. Or it can be highly coherent in the ordering it offers for the whole of experience, but the individuals for whom it is available may enjoy access to another competing and different system, equally coherent in itself, from which they feel free to select segments here and there eclectically, not worrying about the overall lack of coherence. Then there will be conflicts, contradictions and uncoordinated areas of classification for these people.

 This not only describes a few individuals, but it is my contention that this describes the whole of human experience. Nowhere in the modern world especially, except perhaps when alone with oneself, will the individual find a single, coherent, non-contradictory and comprehensive classification of the world. Instead, the individual is faced with dozens or hundreds of partial, conflicting conceptions of the world. Being the adaptable human being her ancestors evolved her to be, however, this utter muddle is rarely a problem in a healthy person. The brain is a reasoning engine built especially to handle this confusion, in fact it thrives on it – the source of much that we call “creative” or “humorous” or “brilliant” is derived from this ever-changing juxtaposition and jostling of different, partial conceptions. Human society expands from the breadth and complexity created by these different classification systems. Communication between strangers depends on the human capacity to process and understand commonalities and fill in the blanks in the signal.

The very thing which defines us as human, our ability to communicate across fuzzy boundaries, is also that thing that creates and exacerbates the Data Integration Problem in our software. Our software “circumscribes with fixed and definite outlines” some small aspect of our experience. In doing so, it denies the fuzziness of our larger reality, and imposes barriers between systems.

The Folk Model – What We Really Build Software From

The anthropological notion of a “folk model” can be a useful paradigm to consider when analyzing the implementation of software applications. Folk models are the proto-scientific conceptualizations of a group of people which they use to describe, understand and interact some aspect of their collective experience.

When writing software, especially but not only within the Agile approach, it is the through the elicitation and joint “discovery” of the user’s folk model that a common set of requirements for the software is defined. Ultimately, it is the closeness of fit between the folk model and the operation and symbology of the software that will determine its success or failure.

Different groups of people faced with the same or similar problems may develop largely similar folk models, and from these, different software development teams may create largely similar software applications. This is one reason why the software development process works best as a hand-crafted enterprise.

But what at first appears to be minor discrepancies between what the software model presents and what the folk model expects can grow so large that it can cause the failure of the software for those users. Especially if the folk model was flawed or in a state of flux at the time the software tried to codify it (and really, when is a folk model not in flux?).

How Meaning Attaches to Data Structures: A Summary

What follows is a high level summary of how humans attach meaning to various kinds of data structures within a computer. It will serve as a good baseline account, though certainly not an exhaustive one, providing a model upon which more detailed dicussion can begin. 

 Background Terminology

Computer systems provide functionality to support the performance and record of business processes. They do that through three inter-related features: DATA, LOGIC, and PRESENTATION. The presentation consists of information displays permitting both an information visualization aspect and an information capture aspect. The logic consists of several aspects, much of it having to do with support of the presentation and manipulation of displays, but also a lot of it having to do with creation, transformation and storage of data. Data consists of sets of symbols constructed in a systematic, regular fashion using a set of data structures. Different data structures are constructed to represent different aspects of the recorded activity. It is in the relationships between the macro and micro structures where the specific detailed information captured.generated by the business process resides. By following a codified, rigid construction of its data structures, the computer system is able to record multiple recurring instances of similar events. Through the development of fixed transformations using program logic, the computer system is able to make routine, conventional conclusions about those events or observations, and it is able to maintain and retain those observations virtually indefinitely.
Data is maintained and stored in DATA STRUCTURES. The more regular these data structures are, the more easily they are interpreted by a broad audience of software developers. In most situations, the PRESENTATION of the data captured by a system to the end user of that system is in a more directly understandable form than the way that information is stored in the computer.  (This statement is not only trivially true, but in a very deep sense too, since the computer actually stores everything using more and more complex sequences of binary digits. That’s a different subject than our current presentation.)  The data structures within the computer system typically exist in two, simultaneous forms, one intended to support human reasoning (through what is often called a “logical”, “abstract” or “conceptual” model) and one supporting manipulations by the computer. Most software developers today strictly deal with the abstract model of the data for design, coding, and discussion. (There are still some developers working in assembly level code, but even that is at a more abstract level than the actual electro-mechanical machinations of the actual hardware!)
An obvious observation, at least on its face, is that different computer systems will store data representing similar ideas using different structures. We need to keep this in the back of our minds as we progress through the rest of this discussion, but it will be more directly adressed in other entries.
 A final thought concerns sets of data of similar structure, called a POPULATION. A population of data consists of some set of data symbols, all constructed using the same data structure pattern which represents a set of similar ideas. The classification of populations of data structures applies to the DATA portion of systems, represents an analogous classification of sets of observed events external to the computer system, and is affected by and affecting the LOGIC and PRESENTATION portions of the computer system. A more detailed definition of the notion of a “population” will also be treated in separate sections.

Commonalities of Structure

Many computer systems, especially those built in support of business (or other human activity) processes, are constructed using a conventional system of abstract data structures. (When I say they are “conventional” what I mean is that the majority of software developers follow conventional patterns for the construction of data structures to represent their idiosynchratic subject areas.) Whether these structures are called “objects”, “tables”, “records”, or something else, they typically take the form of a heterogenous collection of smaller structures grouped together into regular conglomerations. Instances or examples of the larger collections of data structures will each be said to “represent” individual intances of some real-world conglomerate. Each of the individual component element structures of these conglomerations will each be said to represent the individual attributes or characteristics of the real-world conglomerate object. In order to permit efficient processing by the computer,   instances of similar phenomenon will be represented by the same kind of conglomeration.
Typically, business systems will be based on a data structure called a RECORD.  Records consist of a series of “attribute data structures” all related in some fashion to each other. (A more complex structure called an “object” still has record-like attributes combined together to represent a larger whole, the nuances and variation of object-based representation is a subject for later.)  Each RECORD will stereotypically symbolize one instance of a particular concept. This could be a reference to and certain observed details of a real-world object, or it could be something more ephemereal like observations of an event. For example, one “PERSON” record would represent a single individual person.
RECORDS themselves consist of individually defined data elements or FIELDS. Each RECORD of a particular type will share the same set of FIELDS. Each FIELD will symbolize one kind of fact about the thing symbolized by the RECORD. For example, a NAME field on a PERSON record will record what the represented individual’s name is, at least as it was at the time the record was created. 
The set of all records within a system having the same structure will typically be collected and stored together, often in a data structure called a TABLE. Each TABLE will symbolize the set of KNOWN INSTANCES of whatever type of thing each record represents. TABLES are also described as having ROWS and COLUMNS. Each row of a table is one RECORD. The set of shared element-attribute structures across the set of  rows can be described as the “columns” of the table. Each column represents the set of all instances of a FIELD in the table, in other words, the same field across all records. Tables are a commonly used data structure because they readily support interpretation using relational algebra and set theoretic operations, as well as being easily presented and understood both by human and computer.  

Basic Data Structures and Their Relationships

The nomenclature of “record”, ” table”, “row”, “column” and “fields” describes the construction building blocks of an abstract syntactic medium whose usage permits humans to represent complex concepts within the computer system. By assigning names to various collections and combinations of these generic structures, humans project meaning onto them. Using diagrams called “data models”, a short hand of sorts allows the modeler to describe how the generic tables and fields relate to each other and what these relationships signify in the external world. These models also, by virtue of the typified short hand they use, allows for the generation of computer logic that can be applied to a database to support certain standard operations and manipulations of the data generated by a computer system.

Traditional data modeling results in the creation of a data dictionary which relates each structural element to a particular kind of concept. Every structure will be given a name, and if the developers are diligent, these can be associated with more fully realized text descriptions as well. Some aspects of the data structures are not described, at least typically, within a data model, such as populations or subsets of records with similar structures.

Traditional data dictionary entries record name and description of the set of all structures contained in a table. Using a set of structures to represent a set or collection of similar objects is itself a symbolic action. So not only does each row in a table represent one instance of some type of thing, and each column represents one observed (or derived) fact or attribute of that instance, but the collection of all instances of these row data structures also represents the logical set or population of these things.

The strategy for applying meaning to these data structures begins when the decision is made to treat the entirety of each record as the representation of a member of a population of like things. Being similar, then, a set of fields is conceived to capture various detailed observations regarding the things. These fields are intended to capture details about both how each thing is different from the other things in the collection, but also how different things may share similarities. Much of the business logic of the application system will be consumed by the comparisons between individual things, and the mathematical derived counts (and other metrics) of those sets of things (and of subsets within). Using the computer to compare the bit sequences contained in each field, the computer will indicate whether these contents are the same or different between different instances. Humans will then interpret the results of these comparisons by projecting the conclusion out of the computer and into the conceptual world.

For example, let’s say that we have defined the computer sequence “10101010” to represent a reference to a specific person, “Julie Smith”. If we take two different instances of bit sequences and compare them in the computer, the computer will tell us if they are the same or not. As humans, we would then interpret the purely electro-mechanical result which the computer calculated that “10101010” and “10101010” are the same as an indication that the two instances of these sequences represent the same specific person. Likewise, we would interpret a computer result indicating that two bit sequences were not the same as an indication that different people were being referred to.  This type of projection of meaning from mechanical result to logical inference is fundamental to the way humans use computers.

The specific number of fields and their bit sequence representations (data types)  that are developed within a computer application is entirely dependent on the complexity of the problem domain and the attributes of the objects required to reason over that domain. However, no matter how simple or complex, it is the projection of meaning onto the representation of these attributes in the computer and the projection of an interpretation onto the results of the computer comparisons of the physical representations which makes the computer the powerful engine that it is in our society.

How Row Subsets Represent Subpopulations
How Row Subsets Represent Subpopulations

 

A Long Time Ago…

I just came across this pearl of insight that I wrote a long time ago. I think it still stands:

The problem of understanding historical data and its meaning is both one of determining the user’s understanding and acceptance of the data and determining the flexibility of the supporting software. If a record, as understood by the user community, represents a particular concept in a particular way, the desire to re-use the structure implies that a change in the user culture will be required. If the system itself has built-in constraints as well supporting the accepted meaning, then the problem is in the system’s ability to accomodate new meaning, not just in the user’s willingness to accept new meaning. Where both aspects to the historical data problem exist, it should be easier in the long run not to change the meaning of a structure, but rather to implement a new structure with the desired meaning.

Howe, Geoffrey A. and Dr. Geof Goldbogen. “The Integration Analysis Filter: A Software Engineering Technique for Integrating Old and New.” Proceedings of the Fourth International Conference, Expert Systems in Production and Operations Management, May 14-16, 1990.

Example Interaction Between Parent and Child Context

In a previous post, I described in general some of the relationships that could exist between and across a large organization’s sub-contexts. What follows is a short description of some actual observations of how the need for regional autonomy in the examination and collection of taxes affected the use of software data structures at the IRS.

Effect of Context on Systems and Integration Projects

July 15, 2005

Contexts lay claim to individual elements of a syntactic medium. A data structure (syntactic medium) used in more than one context by definition must contain meaningful symbols for each context. Some substructures of the data structure may be purposefully “reserved” for local definition by child contexts. In the larger, shared context, these data structures may have no meaning (see the idea of “traveller” symbols). When used by a child context, the meaning may be idiosyncratic and opaque to the broader context.

One way this might occur is through the agreement across different organizational groups that a certain structure be set aside for such uses. Two examples would include the automated systems at the IRS used respectively for tax examinations and tax collections.

Within the broad context defined by the practitioners of “Tax Examination” which the examination application supports, several child contexts have been purposefully developed corresponding to “regions” of the country. Similar organizational structure have also been defined for “Tax Collection” which the collection application supports. In both systems, portions of the syntactic media have been set aside with the express purpose of allowing the regional contexts to project additional, local meaning into the systems.

While all regions are contained in the larger “Examination” or “Collection” contexts, it was recognized that the sheer size of the respective activities was too great for the IRS central offices to be able to control and react to events on the ground in sufficient time. Hence, recognizing that the smaller regional authorities were in better position to diagnose and adjust their practices, the central authorities each ceded some control. What this allowed was that the regional centers could define customized codes to help them track these local issues, and that each application system would capture and store these local codes without disrupting the overall corporate effort.

Relying on the context defined and controlled by the central authorities would not be practical, and could even stifle innovation in the field. This led directly to the evolution of regional contexts. 

Even though each region shares the same application, and that 80 to 90% – even 95% – of the time, uses it in the same way, each region was permitted to set some of its own business rules. In support of these regional differences in practice, portions of the syntactic medium presented by each of the applications were defined as reserved for use by each region. Often this type of approach would be limited to classification elements or other informational symbols, as opposed to functional markers that would effect the operation of the application.

This strategy permits the activities across the regions to be rolled up into the larger context nearly seamlessly. If each region had been permitted to modify the functionality of the system, the ability to integrate would be quickly eroded, causing the regions to diverge and the regional contexts to share less and less with time. Eventually, such divergence could lead to the need for new bridging contexts, or in the worst case into the collapse of the unified activity of the broader context.

By permitting some regional variation in the meaning and usage of portions of the application systems, the IRS actually strengthened the overall viability of these applications, and mitigated the risk of cultural (and application system) divergence.

Brass Tacks and Comparability

So I thought I should try to explain “comparability” very simply. Reading my previous posts, which were derived from larger texts, I spend a lot of time saying a lot of generalities, and I think the main point is getting missed. So here’s me getting down to brass tacks on the subject.

A computer CPU is a very basic electrical device. Send it a stream of electrons and a command to “add”, and it returns another stream of electrons representing a purely “mechanical” (i.e., unintelligent) electrical result. That CPU doesn’t know anything about semantics, or whether the switches and gates it opens and closes should appropriately be applied to those particular data streams. It just does what it was designed to do given that particular sequence of electron streams. If the streams are comparable before they get to the CPU, then the output will be meaningful. If they are not comparable, then the output (and being a CPU, there will be some output) will not be meaningful.

So the job of the software is to manipulate each symbol before presenting it to the CPU. In particular, the software needs to take each symbol and replace it with one that MEANS the same as the original symbol, but which will present itself to the CPU as COMPARABLE to the other symbols.

Comparability has to be put into the computer, through the software, by a human being. In particular, it is the human who understands when one data stream is not comparable to another, and it is the human being who writes the code to change one stream so that it becomes comparable to the other.

So what really are we talking about? Let me make a non-computer example to show the point.

2 + 00000010 = IV

If I take a pencil and write the above string of characters on a piece of paper, and show it to another computer programmer, after a few moments, I would expect that person to agree that this is a correct mathematical statement

 two plus two equals four

Part of the success of the person in understanding the original statement is that they are able to parse each symbol in the string, interpret the MEANING of each symbol, then translate each into COMPARABLE numeric ideas.

If the computer CPU could experience each symbol as I’ve written it (let’s agree that each of the symbols depicted here would have similar diversity of structure in the computer as they do here on the page), then we can immediately grasp what comparability is. The CPU does not know what the symbols mean, it cannot make the interpretation just by looking at the symbols as they are presented and come to the same conclusion as the human. 

If we look at what I, the human did, to provide you, the reader, with a more readable version of the equation, I replaced each symbol with another one that meant the same, but which appeared as mutually comparable symbols:

  • 2   –>  two
  • +  –>  plus
  • 00000010  –>  two
  • =  –>  equals
  • IV  –>  four

Before the CPU can compare the symbol “2” to the symbol “00000010”, they must both be replaced with two other symbols, each with the standard interpretation of “two”. These new symbols must be structured to flow through the CPU in such a way that their very structure is modified by the CPU to create a third symbol whose standard interpretation has the meaning “four”. The “plus” symbol must be translated into the CPU’s “ADD” instruction, and the “equals” symbol is represented by the stream of electricity leaving the CPU with the resulting symbol.

Is MDM An Attempt to Reach “Consensus Gentium”?

Consensus gentium

An ancient criterion of truth, the consensus gentium (Latin for agreement of the peoples), states “that which is universal among men carries the weight of truth” (Ferm, 64). A number of consensus theories of truth are based on variations of this principle. In some criteria the notion of universal consent is taken strictly, while others qualify the terms of consensus in various ways. There are versions of consensus theory in which the specific population weighing in on a given question, the proportion of the population required for consent, and the period of time needed to declare consensus vary from the classical norm.*

* “Consensus theory of truth”, Wikipedia entry, November 8, 2008.

The Data Thesaurus

October 25, 2005

 So much of IT’s best practices are taken for granted that no one ever asks if there might be a better way. An example of this is in the area of data standards, enterprise data modeling, and Master Data Management (MDM). The core idea of these initiatives is to try to create a single data dictionary in which every concept important to the enterprise is recorded once, with a single standardized name and definition.

The ideal promoted by this approach is that everyone who works with data in the organization will be much more productive if they all follow one naming convention, and if every data item is documented only once. Sounds logical and practical, and yet when we look around for examples of organizations who have managed to successfully create such a document, complete it for ALL of their systems, even commercial software applications, and then who have kept it maintained and complete for more than a year or two, we find very few. In fact in my experience, which has included a number of valiant efforts, I have found no examples.

When one digs into the anecdotal reasons why such success seems so rare, some mixture of the following statements are often heard:

  1. The company lost its will, the sponsor left and so they cut the budget.
  2. It took too long, the business has redirected the staff to focus on “tactical” efforts with short return on investment cycles.
  3. Even after all that work, no one used it, so it was not maintained.
  4. We were fine until the merger, then we just haven’t been able to keep up with the document and the systems integration/consolidation activities.
  5. Our division and their division just never agreed.
  6. We got our part done, but that other group wouldn’t talk to us.

With ultimate failure at the enterprise level the more common experience, it’s surprising that no one involved in the performance and practice of data standardization has questioned what might really be going on. Lots of enterprises have had successes within smaller efforts. Major lines of business may successfully establish their own data dictionaries for specific projects. Yet very few, if any, have succeeded in translating these tactical successes into truly enterprise-altering programs.

What’s going on here is that the search for the “consensus gentium” as the Romans called it, the universal agreement on the facts and nature of the world by a group of individuals, is a never-ending effort. Staying abreast of the changes in the world that affect this consensus is increasingly impossible, if it ever was possible.

 The point here is that IT and the enterprise needs to stop trying to create a single universal dictionary. It must be recognized that such a comprehensive endeavor is an impossible task for all but the most extravagantly financed IT organizations. It can’t be done because the different contexts of the enterprise are constantly morphing and changing. Keeping abreast of changes costs a tremendous amount in both time and effort, and dollars. Proving an appropriate return on investment for such an ongoing endeavour is problematic, and suffers from the problem of diminishing returns.

 A better approach must be out there. One that takes advantage of the tactical point solutions that most enterprises seem to succeed with, while taking into account the practical limitations imposed by the constant press of change that occurs in any “living” enterprise. This blog attempts to document first-principles affecting the entire endeavor, and will build a case based on the human factors which create the problem in the first place.

 A better approach?

Why not build data dictionaries for individual systems or even small groups (as is often the full extent attempted and completed in most organizations). But instead of trying to extend these point solutions into a universal solution, take a different approach, namely the creation of a “data thesaurus” in which portions of each context are related to each other as synonyms, but only as needed for some particular solution. This thesaurus would track the movement of information through the organization by mapping semantics through and across changes in the “syntactics” of the data carrying this information. The thesaurus would need to track the context of a definition, and that definition would be less abstract and more detailed than those created by the current state of the practice. Links across contexts within the organization would be filled in only as practicality required, as the by-product of data integration projects or system consolidation efforts.

 What’s wrong with the data dictionary of today:

  1. obtuse naming conventions (including local standards and ISO)
  2. abstract data structures that have lost connection with actual data structures
  3. only one name for a concept, when different contexts may have their own colloquialisms – making it hard for practitioners to find “their data”, and even causing the introduction of additional entries for synonyms and aliases as if they were separate things
  4. abstracted or generalized definitions reflecting the “least common denominator” and losing the specificity and nuance present in the original contexts
  5. loss of variations and special cases
  6. detachment from modern software development practices like Agile, XP and even SOA

A Parable for Enterprise Data Standardization (as practiced today)

The enterprise data standard goal of choosing “one term for one concept with one definition” would be the same thing as if the United Nations convened an international standards body whose charter would be to review all human languages and then select the “one best” term for every unique concept in the world. Selection, of course, would be fairly determined to ensure that the term that “best captures” the concept, no matter what the original language was in which the idea was first expressed, would be the term selected. Besides the absurd nature of such a task, consider the practical impossibility of such a task.

First, getting sufficient representation of the world’s languages to make the process fair would require a lot of time. Once started, think of the months of argument, the years and decades that would pass before a useful body of terms would be established and agreed upon. Consider also that while these eggheads were deliberating, life around them would continue. How many new words or concepts would be coined in every language before the first missives would come out of this body? Once an initial (partial) standard was chosen, then the proselytizing would begin. Consider the difficult task of convincing the entire world to stop using the terms of their own language. How would the sale be made? Appealing to some future when “everyone will speak the same language” thus eliminating all barriers to communication most likely. As a person in this environment, how do you learn all of those terms – and remember to use them?

The absurdity of this scenario is fairly clear. Then why do so many data standardization efforts approach their very similar problem in the same way? The example above may be extreme, and some will say that I’ve exaggerated the issue, but that’s just the point I’m trying to make. When one talks with the practitioners of data standardization efforts, they almost always believe that the end goal they are striving for is nothing less than the complete standardization of the enterprise. They may realize intellectually that the job may never be finished, but they still believe that the approach is sound, and that if they can just stay at it long enough, they’ll eventually attain the return on investment and justify their long effort.

If the notion of the UN attempting a global standardization effort seems absurd, than why is the best practice of data standardization the very same approach? If we create a continuum for the application of this approach (see figure) starting at the very smallest project (perhaps the definition of the data supporting a small application system used by a subset of a larger enterprise), and ending at this global UN standardization effort, one has to wonder where along this scale does the practical success of the small effort turn into the absurd impossibility of the global effort? If we choose a point on this continuum and say “here and no further” then no doubt arguments will ensue. Probably, there will be individuals who find the parable above to not be ridiculous. Likewise, there will be others who believe that trying any standardization is a waste of time. Others might try to rationally put an end point on the chart at the point representing their current employer. These folks will find, however, that their current employer merges with another enterprise in a few months, which then raises the question is the point of absurdity further out now, at the ends of the combined organization?

Where Is The Threshold of Absurdity in Data Standardization?

Where Is The Threshold of Absurdity in Data Standardization?

Myself, I believe in being practical, as much as possible. The point of absurdity for me is reached whenever the standardization effort becomes divorced from other initiatives of the enterprise and becomes its own goal. When the data standardization focuses on the particular problem at hand, then the return on the effort can be justified. When data standardization is performed for its own sake, no matter how noble or worthy the sentiment expressed behind the effort, then it is eventually going to overextend its reach and fail.

If we all agree that at SOME point on the continuum, attempting data standardization is an absurd endeavor, then we must recognize that there is a limit to the approach of trying to define data standards. The smaller the context, the more the likelihood of success, and the more utility of the standard to that context. Once we have agreed to this premise, the next question that should leap to mind is: Why don’t our data dictionaries, tools, methods, and best practices record the context within which they are defined? Since we agree we must work within some bounds or face an absurdly huge task, why isn’t it clear from our data dictionaries that they are meaningful only within a specific context?

The XML thought leaders have recognized the importance of context, and while I don’t believe their solution will ultimately solve the problems presented by the common multi-context environments we find ourselves working in, it is at least an attempt. This construct is the “namespace” used to unambiguously tie an XML tag to a validating schema.

Data standards proponents, and many data modelers have not recognized the importance and inevitability of context to their work. They come from a background where all data must be rationalized into a single, comprehensive model, resulting in the loss of variation, ideosyncracy and colloquialism from their environments. These last simply become the “burden of legacy systems” which are anathema to the state of the practice.

Comparability: How Software Works

Back in 1990, I was working on a contract with NASA building a prototype database integration application. This was the dawn of the Microsoft Windows era, as Windows 3.0 had just been released (or was about to be). Oracle was still basically a start-up relational database vendor trying to reach critical mindshare. The following things did not yet exist which we take for granted today (and even think of as kind of out dated):

  • ODBC – allowing standardized access to databases from the desktop
  • Microsoft Access and similar personal data management utilities
  • Java (in fact most of the current web software stack was still just the twinkles in the eyes of their subsequent inventors)
  • Message-based engines, although EDI techniques existed
  • SOA and XML data formats
  • Screen-scrapers, user simulators, ETL utilities…

The point is, it was still largely a research project just to connect different databases that an enterprise might be using. Not only did the data representational difficulties that we face today exist back in 1990, but there was also a complete lack of infrastructure to support remote connection to databases: from network communication protocols, to query interfaces, to security and session continuity functions, even to standardized query languages (SQL was not the dominant language for accessing data back then), and more.

In this environment, NASA had asked us to prototype a generic capability that would permit them to take user search criteria, and to query three different database applications. Then, using the returned results from the three databases, our tool was to generate a single, unified query result.

While generally a successful prototype, during a critical review, it became clear to NASA and to us that maintaining such an application would be horribly expensive, so the research effort was ended, and the final report I wrote was delivered, then put into the NASA archives. It is just as well too, because within five years, much of the functional capabilities we’d prototyped had started to become available in more robust, standards-based commercial products.

What follows is a handful of excerpts from the final report, which while now out of context, still expresses some important ideas about how software symbols actually work. The gist of the excerpt describes how software establishes the comparability and sometimes the equivalence of meaning of the symbols it manipulates.

In a nutshell, software works with memory addresses with particular patterns of voltage (or magnetic field direction) representing various concepts from the human world. Software is constantly having to compare such “structures” together in order to establish either equivalence of meaning, or to alter meaning through the alteration of the pattern through heavily constrained manipulations. The key operation for the computer, therefore, is to establish whether or not two symbols are “comparable“. If they are not comparability, quite literally, then the computer cannot reliably compare them and produce a meaningful result.

Without further ado, here are the important excerpts from the research study’s final report, which I wrote and delivered to NASA in November 1990.

“Database Integration Graphical Interface Tools, Future Directions and Development Plan”, Geoff Howe, November 1990

2.2 The Comparability of Fields

There are many kinds of comparisons that can be made among fields. In databases, the simplest level of comparability is at the data type level. If two fields have the same simple data type (e.g., integer, character, fixed string, real number), then they can be compared to each other by a computer. This level of comparability is called “basal comparability”. Thus, if fields A and B are both integers, they can be combined, compared and related in any way appropriate for two integers.

However, two elements meeting the qualification for basal comparability may still be incomparable at the next level, that of the syntactic level. The syntactic level of comparability is that level in which the internal structure of a field becomes important. Examples of internal formats which might matter and might be important at this level include date formats, identification code formats, and string formats. In order to compare two fields in different formats, one or the other of these fields would have to be converted into the other format, or else both would have to be converted into a third format. The only meaningful comparisons that can be made among the fields of a database or databases must be made at the syntactic level.

As an example, suppose A is a field representing a date in Julian format, and suppose B is a field representing a date in Gregorian format. Assuming that both fields are stored as integers, comparing these dates would be meaningless because they lack the same syntactic structure. In order to compare these dates one or the other of these dates would have to be converted into the other format, or else both would have to be converted into a third format.

Unfortunately, having the same syntactic structure is not a guarantee that two fields can be compared meaningfully by a computer process. Rather, syntactic comparability is the minimum requirement for meaningful comparison by computer process. Another form of comparability must be incorporated as well, that of semantic comparability. Semantic comparability is based on the equivalence of the meanings attached to the contents of some pair of data items. The semantics of data items are not readily available to computer processes directly; a separate description in some form must be used to allow the computer to understand the semantic equivalence of concepts. Once such representation is in place, the computer should be able to reason over the semantic equivalence of concepts.

As an example of semantic comparability consider the PCASS fields, ITEM PART NUMBER from the FMEA PARTS table of the PCASFME subsystem, and CRIT_LRU_PART_# from the CRITICAI LRU table of the PCASCLRU subsystem. Under certain circumstances, both of these fields will hold the part numbers of “line replaceable units” or LRUs. Hence, these fields are semantically comparable. Given a list of the contents of ITEM PART NUMBER, and a similar list for CRIT LRU PART #, the assumption can be made that some of the same “line replaceable units” will be referenced in both lists.

Semantic comparability is useful when integrating data from different databases because it can be used to indicate the equivalence of concepts. Yet, semantic comparability does not imply syntactic comparability, and thus both must be present in order to satisfactorily integrate the values of fields from different databases. A definition of the equivalence of fields across databases can now be offered. Two fields are equivalent if they share the same base type; if their internal syntactic structure is the same; if their representational domains are the same; and if they represent the same concept in all contexts.

2.3 Heterogeneous Data Dictionary Architecture

 The approach which seems to have the most documentary support in the research for solving the integration of heterogeneous distributed databases uses a two-tiered data dictionary to support the construction of location-independent queries. The single data dictionary, used by both the single-site database management system, and the homogenous distributed environment, is split in two across the physical-conceptual boundary. This results in a two-level dictionary where one level describes in detail the physical fields of each integrated database, and the second level describes the general concepts stored across systems. For each unique concept represented by the physical level., there would be an entry in the conceptual level data dictionary describing that concept. Figure 2 shows the basic architecture of the two level data dictionary.

As an example of the difference between the conceptual and physical data dictionary levels, consider again the field PCASFME.FMEA PARTS.ITEM PART NUMBER. This is the full name of the actual field in the PCASS database. The physical level of the data dictionary would have this full name, plus the details of how this field is represented (character string, twelve places long). The conceptual level of the data dictionary would contain a description of the contents of the field, and a conceptual field name, “line replaceable unit part number”. Other fields in other tables of PCASS or in other databases may also have the same meaning. This fact poses the problem of mapping the concept to the physical field, which will be described below. Notice, however, how much easier it would be for a user to be able to recall the concept “line replaceable unit part number”, as opposed to the formal field name. This ease of recall is one of the major benefits of the two-level data dictionary being proposed. Two important relationships exist between the conceptual and physical data dictionaries. One of the relationships between fields of the conceptual level data dictionary and fields of the physical level data dictionary can be characterized as one-to-many. That is, one concept in the conceptual data dictionary could have many physical implementations. Identification of this type of relationship would be a matter of identifying and recording the semantic equivalences across system boundaries among fields at the physical level. All physical fields sharing the same meaning are examples of this one-to-many relationship.

Within the PCASS system, the concept of a line replaceable unit part number” occurs in a number of places. It has already been mentioned that both the ITEM PART NUMBER field of the FMEA_PARTS table, and the CRIT LRU PART # field of the CRITICAI_LRU table, represent this concept. The relationship between the concept and these two fields is, therefore, one-to-many.

The second type of relationship which may also be present, depending on the nature of the existing databases, relates several different concepts to a single field. This relationship is characterized as “many-to-one”. Systems which have followed strict database design rules should result in a situation where every field of the database represents one and only one concept. In practical implementations, however, it is often the case that this rule has not been thoroughly implemented, for a variety of reasons. Thus it is more than likely, especially in large database systems, that some field or set of fields may have more than one meaning under various circumstances. Often, these differences in meaning will be indicated by the values of other associated fields.

As an example of this type of relationship, consider the case of the ITEM PART NUMBER field of the PCASS table FMEA PARTS in the FMEA dataset one-more time. This field can have many meanings depending on the value of the PART TYPE field in the same table. If PART TYPE is set to “LRU”, the ITEM PART NUMBER field contains a line replaceable unit part number. If PART TYFE is set to “SRU”, the ITEM PART NUMBER field actually contains a shop replaceable unit part number. Storing both kinds of part numbers in the same structure is convenient. However, in order to use the ITEM PART NUMBER field properly, the user must know how to read and set the PART TYPE field to disambiguate the meaning of any particular instance of the record. Thus, the PART TYPE field in the physical database must hold either an “SRU” or “LRU” flag to indicate the particular meaning desired at any one time.

In the heterogeneous environment, it may be possible to find a different database in which the same two concepts which have been stored in one filed in one database, are stored in separate fields. It may in fact be possible that in one or more databases, only one of the two concepts has been stored. This is certainly the case among the separate data sets which make up the PCASS system. For example, in the PCASCLRU data set, only the “line replaceable unit part number” concept is stored (in the field, CRIT_LRU_PART_#). For this reason, the conceptual level of the data dictionary must include both concepts. Then there must be some appropriate construct within the data definition language of the data dictionary system which could express the constraints under which any particular field had any particular meaning. In order to be useful in raising the level of data location transparency, these conditional semantics must be entered into the data dictionary using this construct.

It is obvious now that the relationship between entries in the conceptual data dictionary and the physical data dictionary is truly many to many (see Figure 3). To implement such a relationship, using relational techniques, a third major structure (in addition to the set of tables supporting the conceptual data dictionary and the set of tables supporting the physical data dictionary) must be developed to mediate this relationship. This structure is described in the next section.

2.3.1 Conceptual – Physical Data Mapping

As an approach to implement this mapping from conceptual to physical structures, a table must be developed which relates every concept with the fields which represent it, and every field with the concepts it represents. This table will consist of tautological statements of the semantic equivalence of physical fields to concepts. A tautology is a logical statement that is true in all contexts and at all times. In thiis approach, the tautologies take the following form (please note that the “==” operator means “is semantically equivalent to”, not “is equal to”):

 normalized field f == field a from location A

 The normalized field f of the above example corresponds directly to an entry in the conceptual data dictionary. We call the field, f, normalized to indicate that it is a standard form. As will be described later, the comparison of values from different databases will be supported by normalizing these values into the representation described in the conceptual data dictionary for the normalized field.

Conditional semantics must now be added to the structure to support discussion. Given a general representation for a tautology, conditional semantics may be represented by adding logical operations to the right side of the equivalence. Assume that a new database, D, has a field, d1, which is equivalent to the normalized field, f, but only when certain other fields have specific values. Logically, we could represent this in the following manner:

normalized field f == field d1 from location D iff
field d2 from location D = VALUE1 AND
field d3 from location D = VALUE2 AND …
field dn from location D opn VALUEn

 In more general terms, the logical statement of the tautology would be as follows:

 R == P iff  E

where R is the normalized field representation, P is the physical field, and E is the set of equivalence constraints which apply to the relation. In our part number example, the following tautologies would be stored in the mapping:

Line Replaceable Unit Part Number == PCASFME.FMEA.PARTS.ITEM_PART_NUMBER iff PCASFME.FMEA.PARTS.PART_TYPE = “LRU”

Shop Replaceable Unit Part Number == PCASFME.FMEA.PARTS.ITEM_PART_NUMBER iff PCASFME.FMEA.PARTS.PART_TYPE = “SRU”

Line Replaceable Unit Part Number == PCASCLRU.CRITICAL_LRU_CRIT_LRU_PART_#

The condition statements are similar to condition statements in the SQL query language. In fact, this similarity is no accident, since these conditions wilt be added to any physical query in which ITEM PART NUMBER is included.

From a user’s point of view, implementing this feature allows the user to create a query over the concept of a line replaceable unit part number without having to know the conditions under which any particular field represents that concept. In addition, by representing the general – concept of a line replaceable unit part number, something the user would be very familiar with, this conceptual mapping technique has also hidden the details of the naming conventions used in each of the physical databases.

2.4.2 Integrating Data Translation Functions Into the Data Dictionary

In the simplest case, the integration of data translation functions into the data dictionary would be a matter of attaching to the data mapping tautologies described above a field which would store an indication of the type of translation which must occur to transform a result from its Location-specific form into the normalized form. This approach can be simplified further by allowing translations at the basal level to be identified by the source and target data types involved, and not recording any further information about the translation. It may not be unreasonable to assume that in certain well-defined domains, most of the translation functions required would be either identity functions or simple basal translation functions.

It is now possible to define completely the data structure required to store any arbitrary physical-conceptual field mapping tautology. The data structure would consist of the following parts:

  • concept field – a single, unique concept which the physical projection represents
  • normalized – a reference to the conceptual data dictionary entry used to represent the concept
  • physical projection – the field or set of fields from the physical data dictionary which under the conditions specified in the equivalence constraints represent the concept
  • equivalence constraints – the conditions under which the physical projection can be said to represent the concept
  • translation function – the function which must be performed on the physical projection in order to transform it into the normalized format of the normalized field

The logical statement of the tautology would be as follows:

R = Ft (P) iff E

where R is the normalized field representation, Ft is the translation function over the physical projection, P, and E is the set of equivalence constraints which apply to the relation. The exact implementation of this data structure would depend on the environment in which the system were to be developed, and would have to be specified in a physical design document. Note that instead of the “==” sign, which was defined above as “is semantically equivalent to”, has been replaced by “=” which means “is equivalent to”, and is a stronger statement. The “=” implies that not only is the left side semantically equivalent to the right, but it is also syntactically equivalent.

What’s in a Name: Not That Much, Actually

The referenced paper is seminal. The comments that appear here are largely unaltered from when I first wrote them back in 1989. I follow this older writing with some additional conclusions, looking back over twenty years of experience working with data.

September 23, 1989:

When parsing a record-based system’s data, the software developer is faced with all of the problems of data structure semantics described by W. Kent (in William Kent, “Limitations of Record Based Information Models”, ACM Transactions on Database Systems 4(1), March 1979. Also John Mylopolous and Michael Brodie (eds), Readings in Artificial Intelligence and Databases, Morgan Kaufman, San Mateo, California, 1989. [20 pp]).

Field naming problems can be handled by naming all fields with a field number, then providing synonyms for all fields. I gave each field a “name” similar to the name of the original system which was possibly meaningless. This name was to allow for maintenance and information mapping between systems. Then, using synonyms I could give a more semantically significant name to the field. The record is just a place keeper – the concept represented is buried in the code supporting the use of the record, or perhaps by agreement (explicit or implicit) among the designers and users of the system. When this agreement is verbal, or worse, implied by training, that’s when the trouble arises: idiosyncratic usage enters the picture, along with the possibly disasterous loss of meaning accompanying the departure of those whose concept is being represented.

November 1, 2009:

This note was just one of several ideas I was toying with as I worked on a thesis paper for my Masters. The project I was working on was to integrate and add expert system capabilities (using Prolog) to an existing business application built on top of COBOL fixed record structures. What it describes is the idea I used to get around the very badly named columns of the COBOL records in order to improve the effectiveness and readability of the Prolog code. The basic trick was to put into the Prolog knowledgebase multiple names for the same data structures and attach to these Prolog structures logic statements that permitted the statement (in nearly human-language terms) of logical constraints.

In later years, I have come to recognize that this problem of naming conventions within code, while important to an extent, is not as important as some practitioners think. The fact of the matter is that the computer could care less what the column name of a table is, or the variable name within a program, etc. For all the computer cares, so long as the programming code references the right data structure at the right moment consistently, the actual references might as well be unique, semantically meaningless numbers.

Naming conventions are for the humans who have to write and maintain the code, or, more generally, who have to directly interact with the data structures. And while there can often be contentious, protracted debate amongst software developers on the “right” naming convention for various situations, in my mind, it is not usually worth the amount of attention it gets during development.

If left to my own devices, then the naming convention I try to impose is as richly semantic as possible. Column names and table names are as close to expressing the intended content, down to including qualifying adjectives, and role names to an appropriate, context-specific noun. The context I select the name from is defined by the context of the problem domain for which the software is being written. I also try to be very consistent in the use of names and name parts from one end to the other of whatever system I’m working on.

If the system already has a naming convention, so long as it can be written down in a set of repeatable rules, I’ll use whatever it is. Oftentimes I find I have to rationalize and standardize terms used previously, due to the fact that at different times, different developers may have used different conventions.

I have participated in efforts at making a universal naming convention, and these have all ultimately hit a wall and been stopped (the reasons for this have been to this point the primary subject of this blog – even if I haven’t explicitly described the scenario yet). Namely, the cross-context politics, long initial duration, required ongoing maintenance activities and ultimately the diminishing returns of such efforts cause them to sink from their own weight.

But even when I have had complete control over the data structure development, and I have had time to craft the “perfect” name for each column, even when I’ve checked and double checked and triple checked that I have consistently applied the same naming convention from one end of the system to the next, once my software has gone into use, it hasn’t taken long for the user community to start redefining the meaning of some aspect of the data structure. Or, the requirement changes and the programming team must change the usage of one of my finely-crafted data structures so that it supports a new meaning, not reflected in that finely crafted name.

This can be frustrating, and it can also pose a long term hazard to the maintenance of the system, as either the original meaning or the new meaning becomes a minority of the usage. But it is not the end of the world, and it does not always break the software if the code is changed to handle the new meaning correctly.

However, it does mean that the actual name of the field no longer reflects the contents it holds. But if the code is working properly, the name no longer matters to the operation of the system. Plus, the maintenance problem such a change presents is also no big deal, so long as the revised meaning is captured in an appropriate dictionary and made available to the programming team for future reference.

Why is this the case? The real truth is that the data structure stores symbols which have a meaning within a context defined by the USERS of the software. The data structures merely represent SYNTAX of the symbols, consisting of the data type of the symbol, and the manipulations of the symbol performed by the code. So long as the manipulations are applied appropriately to the correct part of the syntax, no matter HOW it is named, then the software will manage the MEANING intended by the USERS, despite of, not because of, the naming convention of the data structure.

Hence, what’s in a name used on a data structure? From the computer’s point of view, not so much. From the human’s point of view, since the meaning can change over time, the name shouldn’t be trusted until the code has been reviewed to confirm the content. So there again, not so much…

%d bloggers like this: