Data Integration Musings, Circa 1991

I recently stumbled over this very old text. It is really just notes and musings, but thought it was interesting to see some of my earliest thoughts on the data integration problem. Presented as is.

Mechanical Symbol Systems

To what extent can knowledge be thought of as sentences in an internal language of thought?
Should knowledge by seen as an essentially biological, or essentially social, phenomenon?
Can a machine be said to have intentional states, or are all meanings of internal machine representations essentially rooted in human interpretations of them?

Robot Communities

How can robots and humans share knowledge?
Can artificial reasoners act as vehicles for knowledge transfer between humans? (yes, they already are – see work on training systems)

Human Symbol Systems
Structures: Concepts, Facts and Process
Human Culture
Communication Among Individuals

Discourse

The level of discourse among humans is very complex. Researchers in the natural language processing field would tell you that human discourse is very hard to capture in computer systems. Humans of course have no problem following the subject changes and shifting contexts of discourse.
Language is the means through which humans pass information to one another. Historically, verbal communication has been the primary means of conveying information. Through verbal communication, parents teach their children, conveying not just facts, but also concepts and world view. Through socialization, children learn the locally acceptable way in which to exist in the world. Through continual human contact, all persons reinforce their understanding of the world. Culture is a locally defined set of concepts, facts and processes.

Myth

One of the most important transmission devices for human communication is myth. Myth is story-telling, and therefore is largely verbal in nature.

Ritual

Ritual also is used to communicate knowledge and reiterate beliefs among individuals. Ritual is performance, and can be used to teach process.

Information Systems Structures: Concepts, Facts and Process

The conceptual level of a standard information system may be stored in a database’s data dictionary. In some cases, the data dictionary is fairly simplistic, and may actually be hidden within the processes which maintain the database, inaccessible to outside review except by skilled programmers. More sophisticated data dictionaries, such as IBM’s Repository, and other CASE tools, make explicit the machine-level representation of the data contained in the system. The concepts stored in such devices are largely elementary, and idiosynchratic.
They are elementary in that a single concept in a data dictionary will generally refer to a small item of data called variously a “column” or a “field”. What is expressed by a single entry in a data dictionary is a mapping from an application-specific concept, for instance “PART_NUMBER”, to a machine-dependent, computable format (numeric, 12 decimal digits).
A “fact” in a database sense is a single instance or example of a data dictionary concept coupled with a single value.

Communication Across Information Systems, Custom Approaches

Information systems typically have no provision either to generate or understand discursive communication. Typically, information shared between two information systems must be rigidly defined long before transmission begins. This takes human intervention to define transmission carriers, as well as format, and periodicity.

Networks

The ISO OSI seven layers of communication was an initial attempt at defining the medium of computer communication. All computers which required communications services faced the same problems. Much of the work in networking today is geared toward building this ability to communicate. For humans, communication is through the various senses, taking advantage of the natural characteristics of the environment and the physical body. The majority of computers do not share the same senses.
Distributed systems are those in which all individual systems are connected via a network of transmission lines, and in which some level of pre-defined communication has been developed. The development of distributed database systems represents the first steps toward homogenation of mechanical symbol systems.

Electronic Data Interchange

EDI takes the communication process a step farther by introducing a rudimentary level of discourse among individual enterprises. Typically discourse is restricted to payments and orders of material, and typically these interchanges are just as static as earlier developments. The difference here is that human intervention is slowly developing a cultural definition of the information format and content that may be allowed to be transferred.
As standards are developed describing the exact nature and structure of the information that any company may submit or recieve, more of a culture of discourse can be recognized in the process overall. The discourse is of course carried out by humans at this point, as they define a syntax and semantics for the proper transmission of information in the domain of supply, payment, and delivery (commerce).
Although it is ridiculous to talk of an “EDI culture” as a machine-based, self-defining, self-reinforcing collection of symbols in its own right, it is a step in that direction. What EDI, and especially the development of standards for EDI transmissions, represents is an initial attempt to define societal-like communication among computers. In effect, EDI is extending the means of human discourse into the realm of high-speed transaction processing. The standards being developed for the format and type of transactions allowed represent a formalization and agreement among the society of business enterprises on the future language of commerce.

Raising Consciousness in Mechanical Symbol Systems

In order to partake of the richness and flexibility of human symbol systems, machines must be given control of their own senses. They must become aware of their environment. They must become aware of their own “bodies”. This is the mind-body problem.
mission lines, and in which some level of pre-defined communication has been developed. The development of distributed database systems represents the first steps toward homogenation of mechanical symbol systems.

Electronic Data Interchange

EDI takes the communication process a step farther by introducing a rudimentary level… (Author note: transcript cuts off right here)

Root Causes of the Data Integration Problem

The Fundamental Phenomenon – Human Behavior

4/24/2005

Writing over a century ago, Emile Durkheim and Marcel Mauss recognized and documented the true root cause of today’s data integration woes. (Primitive Classification, 1903, page 5-6 as quoted by Mary Douglas in Natural Symbols, page 61-62)

At the bottom of our conception of class there is the idea of circumscription with fixed and definite outlines. 

Given that this concept of classification is the basis of logic, social discourse, religion and ritual, it should not be a surprise that it also comes into play when software developers write software. They make assumptions and assertions in the design, data and code of their systems that rely on a fixed vision of the problem. Applications may be written for maximum flexibility in some ways, and still there is an intent on the part of the developers to define the breadth and width of the system,  in other words, to bound and fix in place the concepts and relations supportable by the application.

The highly successful ERP products like SAP, JD Edwards, and ORACLE Financials allow tremendous flexibility to configure for different business practices. The breadth of businesses that can make these products work for them is very large. However, it is a common understanding in the ERP professional community (of installers) that there are some things in each product that just can’t be changed or accomplished. In these areas, the business is said to have to change to accommodate the tool. The whole industry of “change management” was born from the need to change the PRACTICE of business due to the ultimate limitations of these systems which were imposed by the conceptual boundaries their authors had to place upon them. (This is a different subject which should be pressed and researched). No matter how flexible the business system is, it is ultimately, and fundamentally, a fixed and bounded symbolic system.

 So how does this relate to my claim that Durkheim and Mauss have unwittingly predicted the current crisis of data integration? Because they go on to point out that: 

It would be impossible to exaggerate, in fact, that state of indistinction from which the human mind developed. Even today a considerable part of our popular literature, our myths, and our religions is based on a fundamental confusion of all images and ideas. They are not separated from each other, as it were, with any clarity. 

This “conceptual stew” is present in every aspect of life. The individual human mind is particularly adept at working within this broad confusion, picking and choosing what to believe is true based on internal processes. Groups of individuals, in order to communicate, will add structure and formality to certain portions thru discussion and negotiation. But this “social” activity is not always accompanied by strong enforcement by the community.

 As Mary Douglas (Natural Symbols, page 62) continues from Durkheim and Mauss, individuals in modern society (and increasingly this encompasses the global community) are presented with many different conceptual mileaus during the course of a single day. Within each person, she indicates,

 A classification system can be coherently organized for a small part of experience, and for the rest it can leave the discrete items jangling in disorder. Or it can be highly coherent in the ordering it offers for the whole of experience, but the individuals for whom it is available may enjoy access to another competing and different system, equally coherent in itself, from which they feel free to select segments here and there eclectically, not worrying about the overall lack of coherence. Then there will be conflicts, contradictions and uncoordinated areas of classification for these people.

 This not only describes a few individuals, but it is my contention that this describes the whole of human experience. Nowhere in the modern world especially, except perhaps when alone with oneself, will the individual find a single, coherent, non-contradictory and comprehensive classification of the world. Instead, the individual is faced with dozens or hundreds of partial, conflicting conceptions of the world. Being the adaptable human being her ancestors evolved her to be, however, this utter muddle is rarely a problem in a healthy person. The brain is a reasoning engine built especially to handle this confusion, in fact it thrives on it – the source of much that we call “creative” or “humorous” or “brilliant” is derived from this ever-changing juxtaposition and jostling of different, partial conceptions. Human society expands from the breadth and complexity created by these different classification systems. Communication between strangers depends on the human capacity to process and understand commonalities and fill in the blanks in the signal.

The very thing which defines us as human, our ability to communicate across fuzzy boundaries, is also that thing that creates and exacerbates the Data Integration Problem in our software. Our software “circumscribes with fixed and definite outlines” some small aspect of our experience. In doing so, it denies the fuzziness of our larger reality, and imposes barriers between systems.

Just What Is Meaning? A Lay Perspective

7/20/2005

The Origin of Symbols, Code and Meaning

Memories are NOT CODED. They are ANALOG recordings, not unlike phonograph records and the old photographs before the invention of digital cameras. There is some evidence that memories are stored in a manner similar to holographs within the medium of the brain. Memories may include recordings of coded information, this would be how symbols are recognized.

 Only when communication between brains is needed does CODE come into play. One brain must create appropriate SYMBOLS which represent the information. These symbols must be physicalized in some manner because the only input mechanism available to the other brain are the five senses of the body. Information is packaged and lumped, nuances and unimportant details are necessarily removed, symbols are selected and generated. If the other brain is receptive, then the symbols are sensed by the body, evoking the memory centers of the second brain. Communication is completed if the second brain understands the code and “remembers” the meaning in its own analog memory.

The Origin of Language

The brain records sensory inputs as memory. The mind constructs an internal symbol system describing the sensory information in ways the human body can communicate or relate the information. Details of the input which the mind cannot put a name to may be remembered or memorable, but cannot be communicated. Have you ever experienced something that you were unable to describe to someone else who had not experienced it?

Two people who have experienced the same or similar types of events can have a conversation about it and begin to form a language. Language is a shortcut to memory. It is the human capacity for the invention of vocabulary that sets them apart from other creatures (and from computers). If two people share a new experience, they’ll be able to talk about it by recognizing the same features in the sensory record and describing it in terms that evoke the same memory in the other person. Eventually, they’ll form a unique vocabulary of short hand symbolic terms and phrases to permit efficient communications. This is how strangers who meet at 12-Step Meetings are able to express and understand each other.

 But if only one of the two persons has experienced the events, there is no referent memory in one of the two. Think of the old saw “a picture is worth a thousand words”. Have you ever heard a new musical piece and tried to explain it to someone who hasn’t heard it? It takes a lot of explanation and yet is ultimately a failure.

 Consider another example: Wine tasting connoisseurs

 These people have an intense sensitivity to subtle features of taste and smell making their experience of wine very rich with information. More importantly, they have been able to attach vocabulary to these differences in unique ways that allows them to communicate with other wine experts. Of course, their success at communicating is predicated on the existence of other individuals with similar talents and experiences. When they try to explain to someone without the sensitivity of taste, their words merely confuse or sound hilariously out of place.

 This is one example of how “context” arises in human communication.

 What does this suggest for our major theme? 

  1. The features that are recognized in the sensory record are dependent first on the individuals whose senses recorded them
  2. The features that are chosen for communication are dependent on the interests and needs of the individuals doing the communicating. Other features that at first do not seem to contribute to the remembrance of the experience are often ignored or discounted.
  3. The vocabulary describing and naming these features is dependent on both the individual who sensed and on the people to whom they try to explain the sensation. Thru trial and error, the person who is trying to communicate will hit upon terms that find resonance in their audience.

Context and Chomsky’s Colorless Green Ideas

Language is code. The speaker chooses the terms, sequence and intonations of their speech with the hope that the listener shares enough of the same human experience to recognize the intended meaning. Conversation is a negotiation as much as anything else. In conversation, the participants can adjust the selection of terms and details until they all reach an understanding of what is being said. This is the practical meaning of “context”, then.

Many years ago, in an effort to make a point about how syntax is different from semantics, Noam Chomsky once proposed the following sentence as an example of a grammatically correct sentence that had no discernible meaning:

Colorless green ideas sleep furiously.

In the context within which Chomsky was writing this sentence, reflective of common cultural experience of these terms among a broad community of American society, he made the claim that the sentence had no meaning. Since that time, other scholars have suggested that there may be contexts in which this construction of terms may actually be meaningful.

Here’s a quote from the english language version of Wikipedia from August 1, 2005:

This phrase can have legitimate meaning to English-Spanish bilinguals, for whom there are double-entendres about the word “green” (meaning “newly-formed”) and “sleep” (used as a verb of non-experience). An equivalent sentence [in the context understood by these English-Spanish bilinguals] would be “Newly formed, bland ideas are unexpressible in an infuriating way.”

This little example provides an excellent case study of the role context plays in communication. Never mind the fact that the sentence was first defined in a context for which it held no meaning. Since the moment of its invention, other contexts have either been recognized or constructed around the sentence in which it holds meaning.

The notion of “context” as that mileiu which drives the interpretation of a sentence such as this is the same notion that explains how the meaning of any coded message must be interpretted. This would include messages encoded in the data structures of computer systems. Data within a omputer system is constructed within and in order to support specific information recordation and transmittal of things important to a specific context. This context is the tacit agreement between the software developers and the business community on what the “typical interpretation” of those computer symbols should be.

The importance of context to the understanding of the data integration problem cannot be understated (which is why I keep coming back to it on this blog). While many theorists recognize the role context plays, and many pundits have written about the failures of computer systems when context has been ignored or mishandled, practitioners continue to develop and deploy applications with little explicit attention to context.

All computer applications written in business today are written from some point of view. This point of view establishes the context of the system. Most developers would agree with these statements. The trick is to define a system which allows the context of the system to change and evolve over time, as the business community learns and invents it. It must be a balancing act between excluding the software equivalent of Chomsky’s meaningless statement, and allowing the software to adapt as the context shifts to allow real meaning to be applied to those structures.

The Syntactics of Speech: What a Language Permits You to Say Is Less Than What You Know

I found this article intensely interesting. It corroborates and validates some of my own ideas about how language and symbols are used in communication. Namely, it suggests that even though a language does not contain structures and syntactic rules allowing for precise designation of a concept, that does not mean that such a concept cannot be communicated and understood by someone who uses that language. It just may take a lot more time to convey the thought. It may also be difficult to confirm the listener’s understanding because the language they have available to respond is the same one as the original message (which we said could not directly convey the meaning).

NY Times article

Types of Information Flow

In a previous post a week or so ago, I riffed on an example of communication between two mountain hikers suggested by Barwise and Seligman (authors of a theory of “information flow”). I made the initial distinction between information flowing within a shared context (in the example, this was the context of Morse Code and flashlight signals), and information flowing from observations of physical phenomenon.
Both types of information movement is covered by Barwise and Seligman’s theory. I propose a further classification of various examples of information flow which will become important as we discuss the operations of individuals across and within bridging contexts.

Types of Flows

Symbols are created within a context for various reasons. There’s a difference between generic information flow and symbollic communication.
Let’s consider a single event whereby information has flowed and been recognized by a person. There are three possible scenarios which may have occurred.

1. Observation/Perception: the person experiences some physical sensation; the conditions of some physical perception leads the person within the context of that perception (and his mental state) to recognize the sensation as significant. In this case, the person recognizes that something has occurred that was important enough to become consciously aware of it’s occurrence. This is new information, but is not necessarily symbollic information.

2. Inference/Deduction: A person within the mental state corresponding to a particular context applies a set of “rules of thumb” over a set of observations (of the first type, likely, but not necessarily exclusively). Drawing on logical inference defined by his current context, he draws a conclusion which follows from these observations to generate new information. This is new information in the sense that without the context to define the rules of inference, those particular perceptions would not have resulted in the “knowledge” of the inference conclusion. They would remain (or they would dissipate) uninterpretted and unrelated forever.

3. Interpretation/Translation: This is the only type of information flow that happens using exclusively symbollic mechanisms. In this type of flow, the person receiving the flow recognizes not only the physical event, but also that the observed phenomenon is symbollic: in other words, that some other person has applied additional meaning to the phenomenon (created a symbol or symbols from the physical media by attaching an additional concept to it). In this type of flow, the perceiving person doesn’t simply register the fact of the physical event, but also recognizes that the physical phenomenon satisfies some context-driven rules of material selection and construction indicating that some other person intentionally constructed it. From this knowledge, the perceiver concludes, assuming they are familiar with the encoding paradigm of the sender’s context, that there is an intended, additional message (meaning) associated with the event. The perceiving party is said to share the context of the sending party if they are also able to interpret/translate the perceived physical sign to recognize the concepts placed there by the sender. In this scenario, the person recieving the message is NOT creating new information. All of the information of this flow was first realized and generated by the message’s sender. (This will be an important detail later as we apply this trichotomy to the operation of software.)

In all three types of information flows, as described by Barwise and Seligman, the flow is dependent on the regularities of the physical world. This regularity requirement applies from the regularity of physical phenomenon, to the reliability of the perceptual apparatus of the perceiver, all the way to the consistency of the encoding paradigm defined by the sender’s context.

Peirce’s Modes of Relationship

According to a terrific survey book on semiotics by Daniel Chandler that I’m reading now, Charles Peirce defined types of signs by whether they were symbollic, iconic, or indexical. If I understand Chandler’s summary, the first two examples of information flow I’ve described are at minimum dependent on Peirce’s indexical signs, alternatively called “natural signs”, because these are the natural perception of reality independent of context. Both the iconic and symbollic signs are only recognizable within a context making both fall under my “interpretation” type of information flow.

For the most part, I will treat the iconic and symbollic signs as the same sort of thing for now.

Bridging Contexts

If it’s true that every human grouping can form its own context, how can communication occur between different groups? If one group defines a set of symbols using some set of concepts and a syntactic media that is different from those of another group, as a practical matter, how can the chasm be spanned? The answer is through the development of bridging contexts.

The following figure depicts several common strategies, each with its particular benefits and drawbacks.

Three Types of Bridging Contexts

Three Types of Bridging Contexts Within One Corporate Organization

There are three basic forms of bridging contexts. First and perhaps the most common in the real world is the creation of a specific, point-to-point bridging context through discussions/negotiations between the representatives of the two specific contexts. Most organizations take this approach because it simplifies, focuses and shortens the discussion, leading to faster turn-around. All application and data interfaces that are custom-built as point-to-point connections, no matter what the actual transmission protocol or language used, fall into this category.

The second form of bridging context occurs when two groups rely on a pre-existing, parent context to act as the bridge. The parent context may push a common context down onto the previously individual contexts, or the two contexts may appeal to the parent to resolve the conflict. In either case, the result can be that the child contexts become absorbed by the parent context, thus eventually what began as a bridging context becomes the entire context. These forms of bridging contexts are often common in such situations as corporate mergers, enterprise architecture initiatives, and business process reengineering projects.

The third form of bridging context is found whenever an organization selects a third-party standard as a communications protocol. In these cases, the organization creates a bridging context between itself and the external standard, including mapping its symbols into those of the standard. Theoretically, once completed, the organization can use such a bridging context to communicate with other organizations that have likewise built bridges to the standard. In practicality, however, it is not uncommon that organizations will bias their bridging context to their own point of view. When this happens, the external standard devolves into mere syntax, and other organizations must create new, subtle bridging contexts (a la form number one) in order to communicate successfully with this organization. This was a common occurrence in the heyday of Electronic Data Interchange (EDI), and still occurs today even with more modern, XML-based standards.

While proponents of standards bodies decry other approaches, it must be stated that the third form of bridging context is also the most complicated to develop, as well as requiring the longest amount of time to establish, and is often the hardest to maintain. The reason for this is that it requires so many more people to define, and for most situations, the key to its success is also its biggest drawback, namely that the context is defined externally to the organization. Thus, the interplay among the membership of the standards body creates the external context. The organization has a business activity establishes the local context. The humans involved in establishing the bridging context must be able to translate from the local context to the external standard. There is always a risk that these individuals will misunderstand the external standard and translate their local context to it incorrectly. In addition, the bridging context must be maintained constantly as changes occur both in the standard and in the internal organization. At least within the local context, it is more likely that a change will be noticed.

In addition to EDI and XML protocols, other examples of the third form of bridging context would include Semantic Web approaches, but also such mundane approaches as the use of ERP systems, or any other packaged application where a fixed syntactic media is presented.

%d bloggers like this: