Data Integration Musings, Circa 1991

I recently stumbled over this very old text. It is really just notes and musings, but thought it was interesting to see some of my earliest thoughts on the data integration problem. Presented as is.

Mechanical Symbol Systems

To what extent can knowledge be thought of as sentences in an internal language of thought?
Should knowledge by seen as an essentially biological, or essentially social, phenomenon?
Can a machine be said to have intentional states, or are all meanings of internal machine representations essentially rooted in human interpretations of them?

Robot Communities

How can robots and humans share knowledge?
Can artificial reasoners act as vehicles for knowledge transfer between humans? (yes, they already are – see work on training systems)

Human Symbol Systems
Structures: Concepts, Facts and Process
Human Culture
Communication Among Individuals

Discourse

The level of discourse among humans is very complex. Researchers in the natural language processing field would tell you that human discourse is very hard to capture in computer systems. Humans of course have no problem following the subject changes and shifting contexts of discourse.
Language is the means through which humans pass information to one another. Historically, verbal communication has been the primary means of conveying information. Through verbal communication, parents teach their children, conveying not just facts, but also concepts and world view. Through socialization, children learn the locally acceptable way in which to exist in the world. Through continual human contact, all persons reinforce their understanding of the world. Culture is a locally defined set of concepts, facts and processes.

Myth

One of the most important transmission devices for human communication is myth. Myth is story-telling, and therefore is largely verbal in nature.

Ritual

Ritual also is used to communicate knowledge and reiterate beliefs among individuals. Ritual is performance, and can be used to teach process.

Information Systems Structures: Concepts, Facts and Process

The conceptual level of a standard information system may be stored in a database’s data dictionary. In some cases, the data dictionary is fairly simplistic, and may actually be hidden within the processes which maintain the database, inaccessible to outside review except by skilled programmers. More sophisticated data dictionaries, such as IBM’s Repository, and other CASE tools, make explicit the machine-level representation of the data contained in the system. The concepts stored in such devices are largely elementary, and idiosynchratic.
They are elementary in that a single concept in a data dictionary will generally refer to a small item of data called variously a “column” or a “field”. What is expressed by a single entry in a data dictionary is a mapping from an application-specific concept, for instance “PART_NUMBER”, to a machine-dependent, computable format (numeric, 12 decimal digits).
A “fact” in a database sense is a single instance or example of a data dictionary concept coupled with a single value.

Communication Across Information Systems, Custom Approaches

Information systems typically have no provision either to generate or understand discursive communication. Typically, information shared between two information systems must be rigidly defined long before transmission begins. This takes human intervention to define transmission carriers, as well as format, and periodicity.

Networks

The ISO OSI seven layers of communication was an initial attempt at defining the medium of computer communication. All computers which required communications services faced the same problems. Much of the work in networking today is geared toward building this ability to communicate. For humans, communication is through the various senses, taking advantage of the natural characteristics of the environment and the physical body. The majority of computers do not share the same senses.
Distributed systems are those in which all individual systems are connected via a network of transmission lines, and in which some level of pre-defined communication has been developed. The development of distributed database systems represents the first steps toward homogenation of mechanical symbol systems.

Electronic Data Interchange

EDI takes the communication process a step farther by introducing a rudimentary level of discourse among individual enterprises. Typically discourse is restricted to payments and orders of material, and typically these interchanges are just as static as earlier developments. The difference here is that human intervention is slowly developing a cultural definition of the information format and content that may be allowed to be transferred.
As standards are developed describing the exact nature and structure of the information that any company may submit or recieve, more of a culture of discourse can be recognized in the process overall. The discourse is of course carried out by humans at this point, as they define a syntax and semantics for the proper transmission of information in the domain of supply, payment, and delivery (commerce).
Although it is ridiculous to talk of an “EDI culture” as a machine-based, self-defining, self-reinforcing collection of symbols in its own right, it is a step in that direction. What EDI, and especially the development of standards for EDI transmissions, represents is an initial attempt to define societal-like communication among computers. In effect, EDI is extending the means of human discourse into the realm of high-speed transaction processing. The standards being developed for the format and type of transactions allowed represent a formalization and agreement among the society of business enterprises on the future language of commerce.

Raising Consciousness in Mechanical Symbol Systems

In order to partake of the richness and flexibility of human symbol systems, machines must be given control of their own senses. They must become aware of their environment. They must become aware of their own “bodies”. This is the mind-body problem.
mission lines, and in which some level of pre-defined communication has been developed. The development of distributed database systems represents the first steps toward homogenation of mechanical symbol systems.

Electronic Data Interchange

EDI takes the communication process a step farther by introducing a rudimentary level… (Author note: transcript cuts off right here)

The Folk Model – What We Really Build Software From

The anthropological notion of a “folk model” can be a useful paradigm to consider when analyzing the implementation of software applications. Folk models are the proto-scientific conceptualizations of a group of people which they use to describe, understand and interact some aspect of their collective experience.

When writing software, especially but not only within the Agile approach, it is the through the elicitation and joint “discovery” of the user’s folk model that a common set of requirements for the software is defined. Ultimately, it is the closeness of fit between the folk model and the operation and symbology of the software that will determine its success or failure.

Different groups of people faced with the same or similar problems may develop largely similar folk models, and from these, different software development teams may create largely similar software applications. This is one reason why the software development process works best as a hand-crafted enterprise.

But what at first appears to be minor discrepancies between what the software model presents and what the folk model expects can grow so large that it can cause the failure of the software for those users. Especially if the folk model was flawed or in a state of flux at the time the software tried to codify it (and really, when is a folk model not in flux?).

A Long Time Ago…

I just came across this pearl of insight that I wrote a long time ago. I think it still stands:

The problem of understanding historical data and its meaning is both one of determining the user’s understanding and acceptance of the data and determining the flexibility of the supporting software. If a record, as understood by the user community, represents a particular concept in a particular way, the desire to re-use the structure implies that a change in the user culture will be required. If the system itself has built-in constraints as well supporting the accepted meaning, then the problem is in the system’s ability to accomodate new meaning, not just in the user’s willingness to accept new meaning. Where both aspects to the historical data problem exist, it should be easier in the long run not to change the meaning of a structure, but rather to implement a new structure with the desired meaning.

Howe, Geoffrey A. and Dr. Geof Goldbogen. “The Integration Analysis Filter: A Software Engineering Technique for Integrating Old and New.” Proceedings of the Fourth International Conference, Expert Systems in Production and Operations Management, May 14-16, 1990.

Context Shifting Is Easy

Today’s discussion asks that you perform a thought experiment.

Imagine that you are sitting in a room with a bunch of other people. All of your chairs face to the front of the room where there is a large desk. A young woman walks in with a stack of papers and places them on the desk. She picks up a piece of chalk from the desk, then, still standing, she turns to face all of you, smiles and begins to speak.

Right here I’m going to pause the narrative and ask that you consider the situation. Imagine it in your head for a moment. What is the context Ive described?

So what do I mean by context? Well if I were to say that our story so far is a very familiar context for most of us, one we all remember from childhood: an elementary school classroom, then here are some of the things you might expect to happen.

Having now stated a context, you, dear reader, should have images of yourselves sitting quietly in your desks while your teacher imparts some lesson. You also already know many of the basic ground rules of being in a classroom:

  • Pay attention to the teacher
  • Take notes
  • Don’t speak unless the teacher calls on you
  • Raise your hand if you have a question or comment and the teacher will call on you

Do you recognize this context? Feels familiar and confortable, right? Great! Let’s hold this thought now and count slowly to twenty while we let the memories of this context play about in our heads.

Really, start counting, or you won’t get the total effect:

1, 2, 3, 4, 5

6, 7, 8, 9, 10

11, 12, 13, 14, 15

16, 17, 18, 19, 20

Now let me throw you a little curve ball and tell you that you’ve been thinking about this in the wrong way. The situation I described is not really a classroom and that woman is not a teacher. She’s an actress, presenting a one-woman show about a famous teacher. The desk is a set, the papers just props. You are not in a classroom, you are in a theater made to appear as a classroom. This is just a play and you are a member of the audience. In fact, so there’s no doubt in your mind about this, you suddenly remember you put your ticket stub in your front pocket.

Did you feel that grinding sensation in your head as you read these last few sentences? That shifting from the classroom to the theater context – you should actually be able to feel it happen in your mind. The fact that even this little bit of information has allowed you to sense a shift in context is not a trivial matter. Usually, when you switch contexts like this, it is never so palpable or apparent. We humans are switching contexts all of the time, sometimes in the same sentence. It is one of our particular talents to recognize and adjust our conceptualizations at will when the context changes.

We have just completely switched contexts and you didn’t even need to lift a finger, did you? Just by my saying “this is a play” your expectations have completely changed. Now that we’re in the “performance context” what has happened to our mutual expectations. First of all, the roles have shifted, instead of a teacher, our woman is an actress, you, dear reader, are not students you are an audience. As a member of the audience (especially an audience witnessing a play about a teacher) here are some of the different expectations you may now have:

  • If you raise your hand, you may get an usher, but the actress will not respond to you
  • While you will still sit quietly and listen, the expectation is that at the end of the performance, you will clap your hands
  • The actress will provide the audience (hopefully) with an entertainment

So, shifting contexts is easy. And thus, I end this little monologue by pointing out that really, dear reader, we aren’t in a theater either. Instead, we’re sharing a context called “reading a blog entry”. I hope you enjoyed this little exercise!

Example Interaction Between Parent and Child Context

In a previous post, I described in general some of the relationships that could exist between and across a large organization’s sub-contexts. What follows is a short description of some actual observations of how the need for regional autonomy in the examination and collection of taxes affected the use of software data structures at the IRS.

Effect of Context on Systems and Integration Projects

July 15, 2005

Contexts lay claim to individual elements of a syntactic medium. A data structure (syntactic medium) used in more than one context by definition must contain meaningful symbols for each context. Some substructures of the data structure may be purposefully “reserved” for local definition by child contexts. In the larger, shared context, these data structures may have no meaning (see the idea of “traveller” symbols). When used by a child context, the meaning may be idiosyncratic and opaque to the broader context.

One way this might occur is through the agreement across different organizational groups that a certain structure be set aside for such uses. Two examples would include the automated systems at the IRS used respectively for tax examinations and tax collections.

Within the broad context defined by the practitioners of “Tax Examination” which the examination application supports, several child contexts have been purposefully developed corresponding to “regions” of the country. Similar organizational structure have also been defined for “Tax Collection” which the collection application supports. In both systems, portions of the syntactic media have been set aside with the express purpose of allowing the regional contexts to project additional, local meaning into the systems.

While all regions are contained in the larger “Examination” or “Collection” contexts, it was recognized that the sheer size of the respective activities was too great for the IRS central offices to be able to control and react to events on the ground in sufficient time. Hence, recognizing that the smaller regional authorities were in better position to diagnose and adjust their practices, the central authorities each ceded some control. What this allowed was that the regional centers could define customized codes to help them track these local issues, and that each application system would capture and store these local codes without disrupting the overall corporate effort.

Relying on the context defined and controlled by the central authorities would not be practical, and could even stifle innovation in the field. This led directly to the evolution of regional contexts. 

Even though each region shares the same application, and that 80 to 90% – even 95% – of the time, uses it in the same way, each region was permitted to set some of its own business rules. In support of these regional differences in practice, portions of the syntactic medium presented by each of the applications were defined as reserved for use by each region. Often this type of approach would be limited to classification elements or other informational symbols, as opposed to functional markers that would effect the operation of the application.

This strategy permits the activities across the regions to be rolled up into the larger context nearly seamlessly. If each region had been permitted to modify the functionality of the system, the ability to integrate would be quickly eroded, causing the regions to diverge and the regional contexts to share less and less with time. Eventually, such divergence could lead to the need for new bridging contexts, or in the worst case into the collapse of the unified activity of the broader context.

By permitting some regional variation in the meaning and usage of portions of the application systems, the IRS actually strengthened the overall viability of these applications, and mitigated the risk of cultural (and application system) divergence.

Overlapping Context and Fuzzy Edges

Parent-Child Context Relationships: Intersection/Union

3/1/2005

The following figures depict some notional ideas for how to graphically describe some of the interesting relationships among contexts as they occur in a large, formal organization. The idea occurred to me that there must be some way of describing the similarities and differences in the concepts and discourse of the various subgroups of an organization (any organization). In the diagram, each oval represents a defined organizational group established by the business to allocate and accomplish all of the work necessary for the business to function. Each oval within another oval represents a specific group of individuals working in that business, until we reach the largest oval representing all employees in all groups. Even this largest oval exists in a larger context, that of the culture at large.

The discussion which follows touches on some incomplete ideas about how the concepts, signs and symbols within a given context relate to those of both smaller child and larger parent contexts.

Graphical depiction of Parent Child Contexts

Above: A Bird's Eye View of Nested Contexts; Below: Cross Section View of Nested Contexts

“Inheritance” of concept flows down from the broadest context down to the lowest context. This is not like the inheritance of properties in an object oriented paradigm, so the term may need to be changed. The idea really is that in the absence of an explicit statement of a concept in a lower level context, the members of the community may defer to the definition of that concept from one of the broader contexts that exist above them. In other words, the larger community of humans may have defined the concept and the more detailed context may neglect to reiterate the concept, preferring instead to use the larger context’s definition.

On the other hand, any concept defined in a broader context may be re-defined at a more detailed level. This may or may not be intentional, or even noticed by either members of the larger context or the more insular context. When noticed, it still doesn’t typically cause a problem in normal human discourse, as the humans are able to translate between each context, and hold in their minds each definition.

Contexts at different levels that do not share the same lineage may define a concept in different ways. If their members do not interact under normal circumstances, then there is still not a problem of communication or data integration. However, problems arise out of this layering and locality-driven conceptualization when the information must be shared, either tete-a-tete through direct interface (as happens in workflow integration problems) or through some roll-up to a common conceptual, parent context (as happens in reporting and business intelligence problems). This is the origin of the “single version of the truth” goal that many organizations now take as a given, best practice.

“Inheritance” of concepts flows down. What this means is that concepts defined in the parent’s broader context may still hold meaning in the more narrow child context. Exceptions/replacements are not limited to replacing concepts from the immediate parent, but can happen with any concept above. Each context layer, almost by definition, will define concepts that are uniquely their own, as well. This is one of the sources of intra-organization argument and confusion, as the same terms (syntactic medium) may be used to refer to two slightly (or even grossly) divergent ideas within the same corporate context.

Not every symbol will be meaningful in every child context, the process of transference of concepts can filter out concepts as well as borrow them. At each contextual layer, shared structure may be given different meanings. Lack of specificity/explicitness of definition at a layer does not imply automatic inheritance from above, as it can also reflect a vagueness of thought or lack of agreement about a fringe aspect.

The vacuum created, however, tends to favor the wholesale borrowing of the concept from the parent context.

Each context layer is complete in its own right. The sizes shown in the diagram suggest a size of content but this is just an artifact of the notation. A child context may define an infinite number of concepts over time, just as its parent context does. Theoretically, each context could be depicted or described in full without reference to the broader parent contexts.

Not every concept defined within any particular layer will wind up represented within some application software used by the humans participating in that context. However, if the humans in that context have acquired software to support their activities, the concepts within that system will naturally conform to the context, although they may force the context to be changed to reflect limitations and capabilities that the software imposes.

The reality is of course much more complicated than the diagram suggests. Since the context at each level is defined by the humans who inhabit and communicate within it, new members may introduce or adapt concepts from other contexts that are unrelated to the hierarchy of autonomy and control. Rather than attempt to trace the origin point of concepts across all contexts, it is recommended that these few concepts be considered  either of local origin, or as part of a bridging context between the context and the context of origin. This will have to be chosen only based on the value to be gained from either point of view.

Bridging contexts are new contexts established to bridge between some subset of concepts from each of two different contexts. These are established when new information communication between the two contexts is required. The bridging context can be recognized by the relative sparseness of the conceptual inventory, and by the fact that the lineage of the concepts is limited to two (or perhaps a handful at most) otherwise disjoint contexts.

Most transaction oriented interfaces, as well as any data interface between two functionally disparate systems (of any type) are defined within a bridging context limited to just the mediating symbols.

The Nature and Experience of Semiosphere Boundaries

I have been having an interesting discussion with Sentence First blogger Stan Carey regarding semiosphere boundaries, and I posted the following comment on his site. I thought I’d repeat it here then elaborate on it.

I’m no expert on Lotman (author of many semiotics papers and coiner of the term “semiosphere”), having only begun to read his work, and I also recognize and agree that there is no such thing as a fixed and easily recognized boundary between semiospheres. Your comment about the boundary really being some sort of  “permeable membrane” is one I agree with. I don’t think from what I have read that Lotman would disagree with you on that point, as he describes the boundary in the following way:

Insofar as the space of the semiosphere has an abstract character, its boundary cannot be visualized by means of concrete imagination. Just as in mathematics the border represents a multiplicity of points, belonging simultaneously to both the internal and external space, the semiotic border is represented by the sum of bilingual translatable “filters”, passing through which the text is translated into another language … situated outside the given semiosphere. (“On the Semiosphere”, Juri Lotman)

I do like his biosphere analogy, and it brings to mind another possible analogy that might be useful, namely that of an “ecosystem”. I’ll be looking into that soon. My notion (and as always it is a laypersons notion) is that the problem of description of a particular ecosystem presents the same puzzle as the identification and description of a semiosphere.

What’s in the ecosystem and what’s outside of it? If we’re talking about a salt marsh ecosystem, for example, where does the geographic border lie? Which creatures are part of the system and which ones are strangers to it (just travelling through)?

If a predator in the woods abutting the salt marsh happens to occasionally eat a creature from the salt marsh when they stray too far from home, does that make the predator part of the salt marsh ecosystem or not? What if they primarily eat forest critters? What if they primarily eat salt marsh critters? What if they eat equal amounts of forest and salt marsh critters?

What we see in this example is that the predator is an edge creature relative to the defined forest and salt marsh ecosystems. When we make this story about a particular individual creature, then whether the predator is in one or another ecosystem is dependent on how that ecosystem has been defined generally.

To the creature, the distinction is meaningless. It lives in both places, walks ground that is sometimes wooded and solid and sometimes muddy and loose. It eats what it can catch from either place. From the predator’s individual point of view, the world consists of bits of both ecosystems. In fact, from their point of view they probably would not recognize that they lived on the margins of two very different environments.

Now add to this the two individual prespectives of a salt marsh prey creature and a forest prey creature. Their typical experience, understanding and adaptation is of the more frequently encountered predators in their milieu. In fact they may have evolved special protections or strategies for foiling these common dangers.

If our predator is mostly a forest feeder, then the forest prey may be well adapted to avoid it, while the salt marsh prey may not. The salt marsh prey in this case may not understand or recognize the danger at all. Or else, if the individual salt marsh creature had spent some time with his pals at the edge of the forest, he may ultimately recognize the predator, although it might take a few moments to react.

Look, an individual creature does not typically experience a disjointed reality. The transition from forest to salt marsh is gradual (but recognizable). Our predator may have a worldview that includes elements of both the forest and the salt marsh. By virtue of this combined perception, the predator may experience what would be considered neither salt marsh nor forest, but the combination and unification of this edge reality.

To turn this back into a discussion of semantics, then…

If we equate our edge creature to a person with knowledge of two different domains (yourself, for example), then we get the same questions: which domain is that person a member of? If he primarily communicates in American vernacular but occasionaly uses Irish idioms, is he more American? If the reverse is true, perhaps he is more Irish?

In my mind the distinction is not so important to the individual, but is certainly more important to the people who share more of the “core” and less of the “periphery” (as Lotman described it) of various spheres. But these distinctions are relative, and what is “core” to one person would be “periphery” to another.

Such an edge person can “digest” and understand many aspects of the “core” of each of the semiospheres they experience. But by virtue of their experiences at the edge between, they may not by fully aware of the all aspects of those cores. Their experience of the semiosphere (as we saw with our predator example) is also not disjointed, but forms a seamless continuum. also does not lack for complexity or meaning, even though it does not represent either core. In fact, the experience of the boundary will be exactly the same in form (but not in content) as the experience of someone else in the center of a semiosphere.

I also think that in the case of the semiosphere, as with our ecosystem example, the “boundary” or “permeable membrane” is generated only by the existence of individual creatures who bridge it and cross freely between the domains. In the case of human communication, however, I think we all are “bridging” these gaps all the time, so much so that we don’t usually experience the shift until we are reminded of them by an unfamiliar word. The mere fact of a term’s unfamiliarity proves the case of a boundary condition for the individual.

%d bloggers like this: