Why Comparability Is Critical To Solving The Data Integration Problem

At its most basic, the task of data integration from multiple source systems is one of recognizing the EQUIVALENCY and diagnosing the CONFLICTS among sets of symbols (the data) stored in each system’s data structures (syntactic media). Data integration is accomplished when the conflicts have been eliminated through TRANSFORMATION into new COMMON SYMBOLS which are COMPARABLE at both the syntactic and semantic levels.

The end result of data integration should be that SEMANTICALLY EQUIVALENT (or at least COMPARABLE) data structures become SYNTACTICALLY EQUIVALENT (COMPARABLE) as well. When this result is achieved, the data structures are considered COMPARABLY EQUIVALENT, and the data from the different source systems can be collapsed, combined or integrated correctly.

Structural Comparability

The issue can be characterized as one of the COMPARABILITY of data between systems.

  • Syntactic Comparability is defined by the DATA TYPE and internal DATA STRUCTURE
  • Semantic Comparability is defined by the CONCEPT or MEANING projected onto the data structure by the users of the source system
  • Two data items are COMPARABLE if they share both SYNTACTIC and SEMANTIC COMPARABILITY

Typical Conflicts

Typical conflicts occur between and among the data structures originating from different sources.

  • Syntactic Conflicts:
    • Data Type Conflicts
    • Structural Conflicts
    • Key Conflicts
  • Semantic Conflicts:
    • Scale Conflicts
    • Abstraction/Formula Conflicts
    • Domain Conflicts
  • Symbol Conflicts:
    • Naming Conflicts (Synonyms, Homonyms, Antonyms)

Syntactic Conflicts

  • Data Type Conflicts – The same concept projected onto different physical representations. Example: different codes for the same set of options
  • Structural Conflicts – For example, the same concept (referent) represented in one database by only a single attribute in one data source, but as a complete record of attributes in another source.
  • Key Conflicts – Two systems using different unique keys for the same concept.
    • As an example, from a freight rail project I once worked, one set of systems represented a “station” by using the nearest Mileboard number to the station, while another set used an industry standard designator called a “SPLC” which was a code assigned to every reported station on all rail lines in North America.
    • In this example, the two different keys conflicted syntactically (e.g., Mileboard was an integer, SPLC was a string), and semantically (e.g., Mileboards are only meaningful within the context of a single railroad, being the distance from the origin of the line, while SPLCs are universal designators within the context of North America railroads).

Semantic Conflicts

  • Scale Conflicts
    • Same data structure but representing different units. For example, corporate revenue represented as currency, but one using US Dollars and the other using CANADIAN Dollars.
  • Abstraction/Formula Conflicts
    • Same data structure and “symbol”, but two different formulas used to calculate values.
  • Domain Conflicts
    • Similar symbols and data structure, but two different sets of valid values or ranges of values.
    • For example, references to Customers in two systems each have assigned numeric identifiers, but the same customer has different assigned identifiers in each system.

Data Integration

The data integration specification documents how the symbols in two (or more) systems are similar and how they are different. The specification describes how the conflicts identified (under the rough categories described above) can be resolved to produce and combine comparable data symbols from each system. From a practical point of view, researching and documenting/describing the conflicts and similarities between symbols in two different systems is the same activity as defining the data integration specification which would be used to automate the integration.


Functions On Symbols

Data integration is a complex problem with many facets. From a semiotic point of view, quite a lot of human cognitive and communicative processing capabilities is involved in the resolution. This post is entering the discussion at a point where a number of necessary terms and concepts have not yet been described on this site. Stay tuned, as I will begin to flesh out these related ideas.

You may also find one of my permanent pages on functions to be helpful.

A Symbol Is Constructed

Recall that we are building tautologies showing equivalence of symbols. Recall that symbols are made up of both signs and concepts.

If we consider a symbol as an OBJECT, we can diagram it using a Unified Modeling Language (UML) notation. Here is a UML Class diagram of the “Symbol” class.

UML Diagram of the "Symbol" Object

UML Diagram of the "Symbol" Object

The figure above depicts how a symbol is constructed from both a set of “signs” and a set of “concepts“. The sign is the arrangement of physical properties and/or objects following an “encoding paradigm” defined by the members of a context. The “concept” is really the meaning which that same set of people (context) has projected onto the symbol. When meaning is projected onto a physical sign, then a symbol is constructed.

Functions Impact Both Structure and Meaning

Symbols within running software are constructed from physical arrangements of electronic components and the electrical and magnetic (and optical) properties of physical matter at various locations (this will be explained in more depth later). The particular arrangement and convention of construction of the sign portion of the symbol defines the syntactic media of the symbol.

Within a context, especially within the software used by that context, the same concept may be projected onto many different symbols of different physical media. To understand what happens, let’s follow an example. Let’s begin with a computer user who wants to create a symbol within a particular piece of software.

Using a mechanical device, the human user selects a button representing the desired symbol and presses it. This event is recognized by the device which generates the new instance of the symbol using its own syntactic medium, which is the pulse of current on a closed electrical circuit on a particular wire. When the symbol is placed in long term storage, it may appear as a particular arrangement of microscopic magnetic fields of various polarities in a particular location on a semi-metalic substrate. When the symbol is in the computer’s memory, it may appear as a set of voltages on various microscopic wires. Finally, when the symbol is projected onto the computer monitor for human presentation, it forms a pattern of phosphoresence against a contrasting background allowing the user to perceive it visually.

Note through all of the last paragraph, I did not mention anything about what the symbol means! The question arises, in this sequence of events, how does the meaning of the symbol get carried from the human, through all of the various physical representations within the computer, and then back out to the human again?

First of all, let’s be clear, that at any particular moment, the symbol that the human user wanted to create through his actions actually becomes several symbols – one symbol for each different syntactic representation (syntactic media) required for it to exist in each of the environments described. Some of these symbols have very short lives, while others have longer lives.

So the meaning projected onto the computer’s keyboard by the human:

  • becomes a symbol in the keyboard,
  • is then transformed into a different symbol in the running hardware and operating system,
  • is transformed into a symbol for storage on the computer’s hard drive, and
  • is also transformed into an image which the human perceives as the shape of the symbol he selected on the keyboard.

But the symbol is not actually “transforming” in the computer, at least in the conventional notion of a thing changing morphology. Instead, the primary operation of the computer is to create a series of new symbols in each of the required syntactic media described, and to discard each of the old symbols in turn.

It does this trick by applying various “functions” to the symbols. These functions may affect both the structure (syntactic media) of the symbol, but possibly also the meaning itself. Most of the time, as the symbol is copied and transferred from one form to another, the meaning does not change. Most of the functions built into the hardware making up the “human-computer interface” (HCI) are “identity” functions, transferring the originally projected concept from one syntactic media form to another. If this were not so, if the symbol printed on the key I press is not the symbol I see on the screen after the computer has “transformed” it from keyboard to wire to hard drive to wire to monitor screen, then I would expect that the computer was broken or faulty, and I would cease to use it.

Sometimes, it is necessary/desirable that the computer apply a function (or a set of functions called a “derivation“) which actually alters the meaning of one symbol (concept), creating a new symbol with a different meaning (and possibly a different structure, too).

Tension and Intention: Shifting Meaning in Software

If a software system is designed for one particular purpose, the data structures will have one set of intended meanings. These will be the meanings which the software developers anticipated would be needed by the user community for which the software was built. This set of intended meanings and the structure and supported relationships make up the “domain” of the software application.

When the software is actually put to use, the user community may actually redefine the meaning of certain parts of the syntactic media defined by the developers. This often happens at the edges of a system, where there may exist sets of symbols whose content are not critical to the operating logic of the application, but which are of the right syntactic media to support the new meaning. The meaning that the user community projects onto the software’s syntactic media forms the context within which the application is used. (See “Packaged Apps Built in Domains But Used in Contexts“)

Software typically has two equally important components. One is the capture, storage, retreival and presentation of symbols meaningful to a human community. The second is a set of symbol transformation processes (i.e., programming logic) which are built in to systematically change both the structure and possibly the meaning of one set of symbols into another set of symbols.

For a simplistic example, perhaps the software reads a symbol representing a letter of the alphabet and it transforms it into an integer using some regular logic (as opposed to picking something at random). This sort of transformation occurs a lot in encryption applications, and is a kind of transformation which preserves the meaning of the original symbol although changing completely its sign (syntactic medium).

When we push data (symbols) from a different source or context into the software application, especially data defined in a context entirely removed from that in which the software was defined and currently used, there are a number of possible ways to interpret what has happened to the original meaning of the symbols in the application.

What are some of the ways of re-interpretation?

  1. The meaning of the original context has expanded to a new, broader, possibly more abstract level, encompassing the meanings of both the original and the new contexts.
  2. Possibly, the mere fact that the original data and the new data have been able to be mixed into the same syntactic media may indicate that the data from the two contexts are actually the same. How might you tell?
  3. Might it also imply that the syntactic medium is more broadly useful, or that the transformation logic are somewhat generically applicable (and hence more semantically benign)?
  4. Are the data from the two contexts cohabitating the same space easily? Are they therefore examples of special cases of a larger, or broader symbollic phenomenon, or merely a happy coincidence made possibe by loose or incomplete software development practices?
  5. How do the combined populations of data symbols fare as maintenance of the software for one of the contexts using it is applied? Does the other context’s data begin to be corrupted? Or is it easy to make software changes to the shared structures? Do changes in the logic and structure supporting one context force additional changes to be made to disambiguate the symbols from the other context?

These questions come to mind (or should) whenever a community starts thinking about using existing applications in new contexts.

Bridge Contexts: Meaning in the Edgeless Boundary

Previously, I’ve written about the idea of the “edgeless boundary” between semiospheres for someone with knowledge of more than one context. This boundary is “edgeless” because to the person perceiving it, there is little or no obvious boundary.

In software systems, especially in situations where different software applications are in use, the boundary between them, by contrast, can be quite stark and apparent. I’ll describe the reasons for this in other postings at a later time. The nutshell explanation is that each software system must be constrained to a well-defined subset of concepts in order to operate consistently. The subset of reality about which a particular application system can capture data (symbols) is limited by design to those regularly observable conditions and events that are of importance to the performance of some business function.

Often (in an ideal scenario), an organization will select only one application to support one set of business functions at a time. A portfolio of applications will thus be constructed through the acquisition/development of different applications for different sets of business functions. As mentioned elsewhere on this site, sometimes an organization will have acquired more than one application of a particular type (see ERP page). 

In any case, information contained in one application oftentimes needs to be replicated into another application within the organization.  When this happens, regardless of the method by which the information is moved from one application to another, a special kind of context must be created/defined in order for the information to flow. This context is called a “bridging context” or simply a “bridge context”.

As described previously, an application system represents a mechanized perception of reality. If we anthropomorphize the application, briefly, we might say that the application forms a semiosphere consisting of the meaning projected onto its syntactic media by the human developers and its current user community, forming symbols (data) which carry the specifically intended meaning of the context.

Two applications, therefore, would present two different semiospheres. The communication of information from one semiosphere to the other occurs when the symbols of one application are deconstructed and transformed into the symbols of the other application, with or without commensurate changes in meaning. This transformation may be effected by human intervention (as through, for example, the interpretation of outputs from one system and the re-coding/data entry into the other), or by automated transformation processes of any type (i.e., other software).

“Meaning” in a Bridging Context

Bridging Contexts have unique features among the genus of contexts overall. They exist primarily to facilitate the movement of information from one context to another. The meaning contained within any Bridging Context is limited to that of the information passing across the bridge. Some of the concepts and facts of the original contexts will be interpretable (and hence will have meaning) within the bridging context only if they are used or transformed during this flow.  Additional information may exist within the bridge context, but will generally be limited to information required to perform or manage the process of transformation.

Hence, I would consider that the knowledge held or communicated by an individual (or system) operating within a bridging context which is otherwise unrelated to either of the original contexts, or of the process of transference, would existing outside of the bridging context, possibly in a third context. As described previously, the individual may or may not perceive the separation of knowledge in this manner.

Special symbols called “travellers” may flow through untouched by transformation and unrecognized within the bridging context. These symbols represent information important in the origin context which may be returned unmodified to the origin context by additional processes. During the course of their trip across the bridging context(s) and through the target contexttravellers typically will have no interpretation, and will simply be passed along in an unmodified syntactic form until returned to their origin, where they can then be interpreted again. By this definition, a traveller is a symbol that flows across a bridge context but which only has meaning in the originating context.

Given a path P from context A to context B, the subset of concepts of A that are required to fulfill the information flow over path P are meaningful within the bridging context surrounding P. Likewise, the subset of concepts of B which are evoked or generated by the information flowing through path P, is also part of the content of the bridge context.  Finally, the path P may generate or use information in the course of events which are neither part of context A nor B. This information is also contained within the bridge context.

Bridge contexts may contain more than one path, and paths may transfer meaning in any direction between the bridged contexts. For that matter, it is possible that any particular bridging context may connect more than two other contexts (for example, when an automated system called an “Operational Data Store” is constructed, or a messaging interface such as those underlying Service Oriented Architecture (SOA) components are built).

An application system itself can represent a special case of a bridging context. An application system marries the context defined by the data modeller to the context defined by the user interface designer. This is almost a trivial distinction, as the two are generally so closely linked that their divergence should not be considered a sign of separate contexts. In this usage, an application user interface can be thought of as existing in the end user’s context, and the application itself acts to bridge that end user context to the context defining the database.

The Context Continuum

So my previous post about the “Origins of a Context” was grossly simplistic. That is however, a good way to get a basic idea out there. Obviously there are many complex factors and layers of influence that affect the extent and content of a context.

One way to look at context is as a continuum from the very small to the very large. This “size” measurement is a reflection of the number of people who share the context, not necessarily the size of the population of concepts and symbols within it.

As I’ve said in other places, a context is defined by its membership first, and its content second.

Hence, by my definition, the smallest context is defined by a single human being. That person would create contexts of a private nature: mementos of their life and personal mnemonics. If the person were artistic, they might create art and artifacts of personal importance. These personal symbols would remain private until the person shares them with someone else.

As soon as they have been shared, even if only with one other person, these artifacts take on additional meaning and become community symbols. Once they have been placed into a larger community, further refinement and re-enforcement of the symbol becomes a community activity. For the original “artist”, their conception can take on a life of its own, and they may lose control over it.

As more and more people become aware of a symbol, the broader the context becomes. But in addition, the symbol itself will begin to change its meaning, either becoming much more generic and broad, or tightening up to some exclusively minimized idea. As soon as this happens (and it happens almost immediately after it begins to be shared) correct interpretation of the symbol must, by definition, take into account which context’s version of the symbol is being considered. Other writers have referred to this issue as one of identifying the “situational” meaning of the symbol, while others talk about the symbol’s “frame”. In my mind these are the same thing as what I’m calling “context”.

So what does this continuum of contexts look like? I’ve drawn a first draft diagram of the smooth transition from personal symbol to the “semiosphere”. It identifies the types and relative sizes of contexts and presents some of the names of their various features. It also shows where in the continuum various types of study and research fall.

I make no claims of absolute accuracy here, and invite comments from experts in these fields (and any others who want to project onto my template).


Continuum of Context from Single Person to Semiosphere

Continuum of Context from Single Person to Semiosphere


Different Contexts Use Different Signs

The following is an excerpt from one of my permanent pages.

Photo of an Actual Stop Sign In Its Normal Context

Photo of an Actual Stop Sign In Its Normal Context

In the Context defined for “driving a car in the United States,” a particularly shaped, painted metal plate attached to a wooden post which has been planted in the ground at the intersection of two roads and facing toward oncoming vehicles represents the concept of a command to the oncoming motorist to “stop” their vehicle when they reach the intersection.

However, a similarly colored and shaped object, say a computer bitmap of a drawing of a “stop sign”, not only is represented by a different Syntactic Medium, it exists in an entirely different context (perhaps one that is not obviously recognized by the casual observer).


Cartoon Drawing of a Stop Sign
Cartoon Drawing of a Stop Sign

If this computer bitmap “stop sign” were to be displayed on a large computer monitor, and this computer monitor was used to replace the wood and metal Stop Sign, even if placed in the same position and orientation as the more typical structure, it is not certain that every driver would recognize the validity of the new Syntactic Medium, which could lead to accidents! This example should give the reader a clear understanding of how a Context constrains and defines the physical structures that are permitted to represent the concepts it contains.

Software Applications As Perception

“The agent has a scheme of individuation whereby it carves the world up into manageable pieces.”                    K. Devlin, “Situation Theory and Situation Semantics”, whitepaper, 2004, Stanford University.

A software application creates and stores repeated examples of symbols defined within the context of a particular human endeavor, representing a perceived conceptual reality, and encoded into signs using electro-magnetic syntactic media. While the software may be linked through automated sensors to an external environment, it is dependent on human perception and translation to capture and create these symbols. Business applications are almost entirely dependent on human perception to recognize events and observations. That said, while the original “perceptions” are made by human agents, the software, by virtue of the automation of the capture of these perceptions, can be said to “perceive” such events (although this should be considered a metaphor).

Application design is in large part the crystallization of a particular set of perceptions of the world for purposes of providing a regular, repeatable mechanism to record a set of like events and occurrences (data). In essence, the things important to be perceived (concepts) either for their regularity or their utility by some human endeavor (context) will determine the data structures (signs) that will be established, and therefore the data (symbols) that can be recorded by the software system.

The aspects important to the recognition and/or use of these repeated events (e.g., the inferences and conclusions to be derived from their occurence) determines the features or qualities and relationships that the application will record.

Good application design anticipates the questions that might be usefully asked about a situation, but it also limits the information to be collected to certain essentials. This is done purposefully because of the fundamental requirement that the attributes collected must be perceived and then encoded into the symbology within the limited power of automated perceptual systems (relative to human perceptual channels).

In other words, because a human is often the PERCEIVER for an application, the application is dependent on the mental and physical activity of the person to capture (encode) the events. In this role, while the human may perceive a wealth of information, the limits of practicality imposed by the human-computer interface (HCI) guarantees that the application will record only a tiny subset of the possible information.

This does not pose any particular problem, per se (except in creating a brittleness in the software in the face of future contextual change), but just illustrates further how the context of the application is more significantly constrained than either the perceived reality or even the boundaries formed from the limits of human discourse of the event. This inequality can be represented by this naive formulation:

Μ(Ac) << Μ(Hc)

The meaning contained in the Application A defined by the context c is much less than the meaning (information) contained in the Human H perception of the context.

It is important also to note that:

Μ(Ac) is a subset of Μ(Hc)

The meaning contained in the Application A is a subset of the meaning contained in the Human H.

No aspect of the application will contain more information than what the human can perceive. This is not to imply that the humans will necessarily be consciously aware of the information within the application. There are whole classes of applications which are intended to collect information otherwise imperceptible to the human directly. In this manner, applications may act as augmentations of human perceptual abilities. But these applications do not of themselves create new conceptions of reality posteriori to their development, but rather are designed explicitly to search for and recognize (perceive) specific events beyond the perception of natural human senses. Even in these situations, the software can only recognize and record symbols representing the subset of possible measurements/features that their human designers have defined for them.

Hence, while software applications may be said to perceive the world, they are limited to the perceptions chosen a priori by their human designers.

%d bloggers like this: