Brass Tacks and Comparability

So I thought I should try to explain “comparability” very simply. Reading my previous posts, which were derived from larger texts, I spend a lot of time saying a lot of generalities, and I think the main point is getting missed. So here’s me getting down to brass tacks on the subject.

A computer CPU is a very basic electrical device. Send it a stream of electrons and a command to “add”, and it returns another stream of electrons representing a purely “mechanical” (i.e., unintelligent) electrical result. That CPU doesn’t know anything about semantics, or whether the switches and gates it opens and closes should appropriately be applied to those particular data streams. It just does what it was designed to do given that particular sequence of electron streams. If the streams are comparable before they get to the CPU, then the output will be meaningful. If they are not comparable, then the output (and being a CPU, there will be some output) will not be meaningful.

So the job of the software is to manipulate each symbol before presenting it to the CPU. In particular, the software needs to take each symbol and replace it with one that MEANS the same as the original symbol, but which will present itself to the CPU as COMPARABLE to the other symbols.

Comparability has to be put into the computer, through the software, by a human being. In particular, it is the human who understands when one data stream is not comparable to another, and it is the human being who writes the code to change one stream so that it becomes comparable to the other.

So what really are we talking about? Let me make a non-computer example to show the point.

2 + 00000010 = IV

If I take a pencil and write the above string of characters on a piece of paper, and show it to another computer programmer, after a few moments, I would expect that person to agree that this is a correct mathematical statement

 two plus two equals four

Part of the success of the person in understanding the original statement is that they are able to parse each symbol in the string, interpret the MEANING of each symbol, then translate each into COMPARABLE numeric ideas.

If the computer CPU could experience each symbol as I’ve written it (let’s agree that each of the symbols depicted here would have similar diversity of structure in the computer as they do here on the page), then we can immediately grasp what comparability is. The CPU does not know what the symbols mean, it cannot make the interpretation just by looking at the symbols as they are presented and come to the same conclusion as the human. 

If we look at what I, the human did, to provide you, the reader, with a more readable version of the equation, I replaced each symbol with another one that meant the same, but which appeared as mutually comparable symbols:

  • 2   –>  two
  • +  –>  plus
  • 00000010  –>  two
  • =  –>  equals
  • IV  –>  four

Before the CPU can compare the symbol “2” to the symbol “00000010”, they must both be replaced with two other symbols, each with the standard interpretation of “two”. These new symbols must be structured to flow through the CPU in such a way that their very structure is modified by the CPU to create a third symbol whose standard interpretation has the meaning “four”. The “plus” symbol must be translated into the CPU’s “ADD” instruction, and the “equals” symbol is represented by the stream of electricity leaving the CPU with the resulting symbol.

Advertisements

Is MDM An Attempt to Reach “Consensus Gentium”?

Consensus gentium

An ancient criterion of truth, the consensus gentium (Latin for agreement of the peoples), states “that which is universal among men carries the weight of truth” (Ferm, 64). A number of consensus theories of truth are based on variations of this principle. In some criteria the notion of universal consent is taken strictly, while others qualify the terms of consensus in various ways. There are versions of consensus theory in which the specific population weighing in on a given question, the proportion of the population required for consent, and the period of time needed to declare consensus vary from the classical norm.*

* “Consensus theory of truth”, Wikipedia entry, November 8, 2008.

The Data Thesaurus

October 25, 2005

 So much of IT’s best practices are taken for granted that no one ever asks if there might be a better way. An example of this is in the area of data standards, enterprise data modeling, and Master Data Management (MDM). The core idea of these initiatives is to try to create a single data dictionary in which every concept important to the enterprise is recorded once, with a single standardized name and definition.

The ideal promoted by this approach is that everyone who works with data in the organization will be much more productive if they all follow one naming convention, and if every data item is documented only once. Sounds logical and practical, and yet when we look around for examples of organizations who have managed to successfully create such a document, complete it for ALL of their systems, even commercial software applications, and then who have kept it maintained and complete for more than a year or two, we find very few. In fact in my experience, which has included a number of valiant efforts, I have found no examples.

When one digs into the anecdotal reasons why such success seems so rare, some mixture of the following statements are often heard:

  1. The company lost its will, the sponsor left and so they cut the budget.
  2. It took too long, the business has redirected the staff to focus on “tactical” efforts with short return on investment cycles.
  3. Even after all that work, no one used it, so it was not maintained.
  4. We were fine until the merger, then we just haven’t been able to keep up with the document and the systems integration/consolidation activities.
  5. Our division and their division just never agreed.
  6. We got our part done, but that other group wouldn’t talk to us.

With ultimate failure at the enterprise level the more common experience, it’s surprising that no one involved in the performance and practice of data standardization has questioned what might really be going on. Lots of enterprises have had successes within smaller efforts. Major lines of business may successfully establish their own data dictionaries for specific projects. Yet very few, if any, have succeeded in translating these tactical successes into truly enterprise-altering programs.

What’s going on here is that the search for the “consensus gentium” as the Romans called it, the universal agreement on the facts and nature of the world by a group of individuals, is a never-ending effort. Staying abreast of the changes in the world that affect this consensus is increasingly impossible, if it ever was possible.

 The point here is that IT and the enterprise needs to stop trying to create a single universal dictionary. It must be recognized that such a comprehensive endeavor is an impossible task for all but the most extravagantly financed IT organizations. It can’t be done because the different contexts of the enterprise are constantly morphing and changing. Keeping abreast of changes costs a tremendous amount in both time and effort, and dollars. Proving an appropriate return on investment for such an ongoing endeavour is problematic, and suffers from the problem of diminishing returns.

 A better approach must be out there. One that takes advantage of the tactical point solutions that most enterprises seem to succeed with, while taking into account the practical limitations imposed by the constant press of change that occurs in any “living” enterprise. This blog attempts to document first-principles affecting the entire endeavor, and will build a case based on the human factors which create the problem in the first place.

 A better approach?

Why not build data dictionaries for individual systems or even small groups (as is often the full extent attempted and completed in most organizations). But instead of trying to extend these point solutions into a universal solution, take a different approach, namely the creation of a “data thesaurus” in which portions of each context are related to each other as synonyms, but only as needed for some particular solution. This thesaurus would track the movement of information through the organization by mapping semantics through and across changes in the “syntactics” of the data carrying this information. The thesaurus would need to track the context of a definition, and that definition would be less abstract and more detailed than those created by the current state of the practice. Links across contexts within the organization would be filled in only as practicality required, as the by-product of data integration projects or system consolidation efforts.

 What’s wrong with the data dictionary of today:

  1. obtuse naming conventions (including local standards and ISO)
  2. abstract data structures that have lost connection with actual data structures
  3. only one name for a concept, when different contexts may have their own colloquialisms – making it hard for practitioners to find “their data”, and even causing the introduction of additional entries for synonyms and aliases as if they were separate things
  4. abstracted or generalized definitions reflecting the “least common denominator” and losing the specificity and nuance present in the original contexts
  5. loss of variations and special cases
  6. detachment from modern software development practices like Agile, XP and even SOA

A Parable for Enterprise Data Standardization (as practiced today)

The enterprise data standard goal of choosing “one term for one concept with one definition” would be the same thing as if the United Nations convened an international standards body whose charter would be to review all human languages and then select the “one best” term for every unique concept in the world. Selection, of course, would be fairly determined to ensure that the term that “best captures” the concept, no matter what the original language was in which the idea was first expressed, would be the term selected. Besides the absurd nature of such a task, consider the practical impossibility of such a task.

First, getting sufficient representation of the world’s languages to make the process fair would require a lot of time. Once started, think of the months of argument, the years and decades that would pass before a useful body of terms would be established and agreed upon. Consider also that while these eggheads were deliberating, life around them would continue. How many new words or concepts would be coined in every language before the first missives would come out of this body? Once an initial (partial) standard was chosen, then the proselytizing would begin. Consider the difficult task of convincing the entire world to stop using the terms of their own language. How would the sale be made? Appealing to some future when “everyone will speak the same language” thus eliminating all barriers to communication most likely. As a person in this environment, how do you learn all of those terms – and remember to use them?

The absurdity of this scenario is fairly clear. Then why do so many data standardization efforts approach their very similar problem in the same way? The example above may be extreme, and some will say that I’ve exaggerated the issue, but that’s just the point I’m trying to make. When one talks with the practitioners of data standardization efforts, they almost always believe that the end goal they are striving for is nothing less than the complete standardization of the enterprise. They may realize intellectually that the job may never be finished, but they still believe that the approach is sound, and that if they can just stay at it long enough, they’ll eventually attain the return on investment and justify their long effort.

If the notion of the UN attempting a global standardization effort seems absurd, than why is the best practice of data standardization the very same approach? If we create a continuum for the application of this approach (see figure) starting at the very smallest project (perhaps the definition of the data supporting a small application system used by a subset of a larger enterprise), and ending at this global UN standardization effort, one has to wonder where along this scale does the practical success of the small effort turn into the absurd impossibility of the global effort? If we choose a point on this continuum and say “here and no further” then no doubt arguments will ensue. Probably, there will be individuals who find the parable above to not be ridiculous. Likewise, there will be others who believe that trying any standardization is a waste of time. Others might try to rationally put an end point on the chart at the point representing their current employer. These folks will find, however, that their current employer merges with another enterprise in a few months, which then raises the question is the point of absurdity further out now, at the ends of the combined organization?

Where Is The Threshold of Absurdity in Data Standardization?

Where Is The Threshold of Absurdity in Data Standardization?

Myself, I believe in being practical, as much as possible. The point of absurdity for me is reached whenever the standardization effort becomes divorced from other initiatives of the enterprise and becomes its own goal. When the data standardization focuses on the particular problem at hand, then the return on the effort can be justified. When data standardization is performed for its own sake, no matter how noble or worthy the sentiment expressed behind the effort, then it is eventually going to overextend its reach and fail.

If we all agree that at SOME point on the continuum, attempting data standardization is an absurd endeavor, then we must recognize that there is a limit to the approach of trying to define data standards. The smaller the context, the more the likelihood of success, and the more utility of the standard to that context. Once we have agreed to this premise, the next question that should leap to mind is: Why don’t our data dictionaries, tools, methods, and best practices record the context within which they are defined? Since we agree we must work within some bounds or face an absurdly huge task, why isn’t it clear from our data dictionaries that they are meaningful only within a specific context?

The XML thought leaders have recognized the importance of context, and while I don’t believe their solution will ultimately solve the problems presented by the common multi-context environments we find ourselves working in, it is at least an attempt. This construct is the “namespace” used to unambiguously tie an XML tag to a validating schema.

Data standards proponents, and many data modelers have not recognized the importance and inevitability of context to their work. They come from a background where all data must be rationalized into a single, comprehensive model, resulting in the loss of variation, ideosyncracy and colloquialism from their environments. These last simply become the “burden of legacy systems” which are anathema to the state of the practice.

Why Comparability Is Critical To Solving The Data Integration Problem

At its most basic, the task of data integration from multiple source systems is one of recognizing the EQUIVALENCY and diagnosing the CONFLICTS among sets of symbols (the data) stored in each system’s data structures (syntactic media). Data integration is accomplished when the conflicts have been eliminated through TRANSFORMATION into new COMMON SYMBOLS which are COMPARABLE at both the syntactic and semantic levels.

The end result of data integration should be that SEMANTICALLY EQUIVALENT (or at least COMPARABLE) data structures become SYNTACTICALLY EQUIVALENT (COMPARABLE) as well. When this result is achieved, the data structures are considered COMPARABLY EQUIVALENT, and the data from the different source systems can be collapsed, combined or integrated correctly.

Structural Comparability

The issue can be characterized as one of the COMPARABILITY of data between systems.

  • Syntactic Comparability is defined by the DATA TYPE and internal DATA STRUCTURE
  • Semantic Comparability is defined by the CONCEPT or MEANING projected onto the data structure by the users of the source system
  • Two data items are COMPARABLE if they share both SYNTACTIC and SEMANTIC COMPARABILITY

Typical Conflicts

Typical conflicts occur between and among the data structures originating from different sources.

  • Syntactic Conflicts:
    • Data Type Conflicts
    • Structural Conflicts
    • Key Conflicts
  • Semantic Conflicts:
    • Scale Conflicts
    • Abstraction/Formula Conflicts
    • Domain Conflicts
  • Symbol Conflicts:
    • Naming Conflicts (Synonyms, Homonyms, Antonyms)

Syntactic Conflicts

  • Data Type Conflicts – The same concept projected onto different physical representations. Example: different codes for the same set of options
  • Structural Conflicts – For example, the same concept (referent) represented in one database by only a single attribute in one data source, but as a complete record of attributes in another source.
  • Key Conflicts – Two systems using different unique keys for the same concept.
    • As an example, from a freight rail project I once worked, one set of systems represented a “station” by using the nearest Mileboard number to the station, while another set used an industry standard designator called a “SPLC” which was a code assigned to every reported station on all rail lines in North America.
    • In this example, the two different keys conflicted syntactically (e.g., Mileboard was an integer, SPLC was a string), and semantically (e.g., Mileboards are only meaningful within the context of a single railroad, being the distance from the origin of the line, while SPLCs are universal designators within the context of North America railroads).

Semantic Conflicts

  • Scale Conflicts
    • Same data structure but representing different units. For example, corporate revenue represented as currency, but one using US Dollars and the other using CANADIAN Dollars.
  • Abstraction/Formula Conflicts
    • Same data structure and “symbol”, but two different formulas used to calculate values.
  • Domain Conflicts
    • Similar symbols and data structure, but two different sets of valid values or ranges of values.
    • For example, references to Customers in two systems each have assigned numeric identifiers, but the same customer has different assigned identifiers in each system.

Data Integration

The data integration specification documents how the symbols in two (or more) systems are similar and how they are different. The specification describes how the conflicts identified (under the rough categories described above) can be resolved to produce and combine comparable data symbols from each system. From a practical point of view, researching and documenting/describing the conflicts and similarities between symbols in two different systems is the same activity as defining the data integration specification which would be used to automate the integration.

What is “Comparability”?

What is “comparability“? Basically it is a relationship between two things. If two things are “comparable“, in general parlance, then they are similar in some aspects. They share common features or functions. They are not “equal” necessarily, as there may be important differences between them. In fact, it may be that the interesting aspect of the comparison made between the two objects is in their difference, more than in their similarity. However, a test for equivalence is a very common comparison to make for things that are comparable.

Typically, the comparison will be made with respect to some common constraint, from a particular point of view, or within a particular context. Any two things can be compared, although the meaningfulness an dutility of the comparison is not always guaranteed. The most meaningful/useful comparisons will occur within a context where the two things are strongly similar.

For a simple example, consider comparing ants and humans. In order to do this meaningfully, a context for comparison must be established, and a set of common properties must be recognized. Comparing the “wing span” property of ants an dhumans would be a meaningless comparison, since humans have no wings, and most ants do not either. Comparing the anatomy of an example of each type of creature might form a context where the property “number of limbs” could generate a meaningful result.

Comparing the “strength” property of a human versus an ant may also be meainingless or at least misleading. The absolute strength of the human will be much higher than the absolute strength of the ant. However, comparing the “strength relative to weight” of each creature can tell us something much more interesting. The relative strength, where the weight of objects each creature can pick up is divided by the body mass (weight) of the creature.

Hence, while comparing absolute strength between ants and humans is meaningful, it is not terribly useful. Once the relative strength has been calculated, a meaningful and potentially useful comparison can be made, giving us an “apples to apples” comparison. By adjusting the strength property of each creature, we have created a comparison which is both meaningful and useful.

In this example, it is useful from the standpoint that the comparison is more understandable.

We have effected this improvement in the meaningfulness by establishing the context of comparison through the application of functions to the values of the creatures native properties. In other words, we have applied similar “conversion” functions in similar ways to the ant and human “strength” and “weight” properties to derive two new properties which are <em>more</em> comparable than each of the original values on their own.

The approach we took was to find where two things are analogous – where their similarties lie – and then to translate their analogous properties into meaningful and useful new values which can be compared.

The idea expressed by the term “comparability” implies that there will be similarities between the things compared. It also presupposes the expectation if not the a priori knowledge that there will be some differences, and that the differences between analogous properties can provide insight and knowledge.

Comparability: How Software Works

Back in 1990, I was working on a contract with NASA building a prototype database integration application. This was the dawn of the Microsoft Windows era, as Windows 3.0 had just been released (or was about to be). Oracle was still basically a start-up relational database vendor trying to reach critical mindshare. The following things did not yet exist which we take for granted today (and even think of as kind of out dated):

  • ODBC – allowing standardized access to databases from the desktop
  • Microsoft Access and similar personal data management utilities
  • Java (in fact most of the current web software stack was still just the twinkles in the eyes of their subsequent inventors)
  • Message-based engines, although EDI techniques existed
  • SOA and XML data formats
  • Screen-scrapers, user simulators, ETL utilities…

The point is, it was still largely a research project just to connect different databases that an enterprise might be using. Not only did the data representational difficulties that we face today exist back in 1990, but there was also a complete lack of infrastructure to support remote connection to databases: from network communication protocols, to query interfaces, to security and session continuity functions, even to standardized query languages (SQL was not the dominant language for accessing data back then), and more.

In this environment, NASA had asked us to prototype a generic capability that would permit them to take user search criteria, and to query three different database applications. Then, using the returned results from the three databases, our tool was to generate a single, unified query result.

While generally a successful prototype, during a critical review, it became clear to NASA and to us that maintaining such an application would be horribly expensive, so the research effort was ended, and the final report I wrote was delivered, then put into the NASA archives. It is just as well too, because within five years, much of the functional capabilities we’d prototyped had started to become available in more robust, standards-based commercial products.

What follows is a handful of excerpts from the final report, which while now out of context, still expresses some important ideas about how software symbols actually work. The gist of the excerpt describes how software establishes the comparability and sometimes the equivalence of meaning of the symbols it manipulates.

In a nutshell, software works with memory addresses with particular patterns of voltage (or magnetic field direction) representing various concepts from the human world. Software is constantly having to compare such “structures” together in order to establish either equivalence of meaning, or to alter meaning through the alteration of the pattern through heavily constrained manipulations. The key operation for the computer, therefore, is to establish whether or not two symbols are “comparable“. If they are not comparability, quite literally, then the computer cannot reliably compare them and produce a meaningful result.

Without further ado, here are the important excerpts from the research study’s final report, which I wrote and delivered to NASA in November 1990.

“Database Integration Graphical Interface Tools, Future Directions and Development Plan”, Geoff Howe, November 1990

2.2 The Comparability of Fields

There are many kinds of comparisons that can be made among fields. In databases, the simplest level of comparability is at the data type level. If two fields have the same simple data type (e.g., integer, character, fixed string, real number), then they can be compared to each other by a computer. This level of comparability is called “basal comparability”. Thus, if fields A and B are both integers, they can be combined, compared and related in any way appropriate for two integers.

However, two elements meeting the qualification for basal comparability may still be incomparable at the next level, that of the syntactic level. The syntactic level of comparability is that level in which the internal structure of a field becomes important. Examples of internal formats which might matter and might be important at this level include date formats, identification code formats, and string formats. In order to compare two fields in different formats, one or the other of these fields would have to be converted into the other format, or else both would have to be converted into a third format. The only meaningful comparisons that can be made among the fields of a database or databases must be made at the syntactic level.

As an example, suppose A is a field representing a date in Julian format, and suppose B is a field representing a date in Gregorian format. Assuming that both fields are stored as integers, comparing these dates would be meaningless because they lack the same syntactic structure. In order to compare these dates one or the other of these dates would have to be converted into the other format, or else both would have to be converted into a third format.

Unfortunately, having the same syntactic structure is not a guarantee that two fields can be compared meaningfully by a computer process. Rather, syntactic comparability is the minimum requirement for meaningful comparison by computer process. Another form of comparability must be incorporated as well, that of semantic comparability. Semantic comparability is based on the equivalence of the meanings attached to the contents of some pair of data items. The semantics of data items are not readily available to computer processes directly; a separate description in some form must be used to allow the computer to understand the semantic equivalence of concepts. Once such representation is in place, the computer should be able to reason over the semantic equivalence of concepts.

As an example of semantic comparability consider the PCASS fields, ITEM PART NUMBER from the FMEA PARTS table of the PCASFME subsystem, and CRIT_LRU_PART_# from the CRITICAI LRU table of the PCASCLRU subsystem. Under certain circumstances, both of these fields will hold the part numbers of “line replaceable units” or LRUs. Hence, these fields are semantically comparable. Given a list of the contents of ITEM PART NUMBER, and a similar list for CRIT LRU PART #, the assumption can be made that some of the same “line replaceable units” will be referenced in both lists.

Semantic comparability is useful when integrating data from different databases because it can be used to indicate the equivalence of concepts. Yet, semantic comparability does not imply syntactic comparability, and thus both must be present in order to satisfactorily integrate the values of fields from different databases. A definition of the equivalence of fields across databases can now be offered. Two fields are equivalent if they share the same base type; if their internal syntactic structure is the same; if their representational domains are the same; and if they represent the same concept in all contexts.

2.3 Heterogeneous Data Dictionary Architecture

 The approach which seems to have the most documentary support in the research for solving the integration of heterogeneous distributed databases uses a two-tiered data dictionary to support the construction of location-independent queries. The single data dictionary, used by both the single-site database management system, and the homogenous distributed environment, is split in two across the physical-conceptual boundary. This results in a two-level dictionary where one level describes in detail the physical fields of each integrated database, and the second level describes the general concepts stored across systems. For each unique concept represented by the physical level., there would be an entry in the conceptual level data dictionary describing that concept. Figure 2 shows the basic architecture of the two level data dictionary.

As an example of the difference between the conceptual and physical data dictionary levels, consider again the field PCASFME.FMEA PARTS.ITEM PART NUMBER. This is the full name of the actual field in the PCASS database. The physical level of the data dictionary would have this full name, plus the details of how this field is represented (character string, twelve places long). The conceptual level of the data dictionary would contain a description of the contents of the field, and a conceptual field name, “line replaceable unit part number”. Other fields in other tables of PCASS or in other databases may also have the same meaning. This fact poses the problem of mapping the concept to the physical field, which will be described below. Notice, however, how much easier it would be for a user to be able to recall the concept “line replaceable unit part number”, as opposed to the formal field name. This ease of recall is one of the major benefits of the two-level data dictionary being proposed. Two important relationships exist between the conceptual and physical data dictionaries. One of the relationships between fields of the conceptual level data dictionary and fields of the physical level data dictionary can be characterized as one-to-many. That is, one concept in the conceptual data dictionary could have many physical implementations. Identification of this type of relationship would be a matter of identifying and recording the semantic equivalences across system boundaries among fields at the physical level. All physical fields sharing the same meaning are examples of this one-to-many relationship.

Within the PCASS system, the concept of a line replaceable unit part number” occurs in a number of places. It has already been mentioned that both the ITEM PART NUMBER field of the FMEA_PARTS table, and the CRIT LRU PART # field of the CRITICAI_LRU table, represent this concept. The relationship between the concept and these two fields is, therefore, one-to-many.

The second type of relationship which may also be present, depending on the nature of the existing databases, relates several different concepts to a single field. This relationship is characterized as “many-to-one”. Systems which have followed strict database design rules should result in a situation where every field of the database represents one and only one concept. In practical implementations, however, it is often the case that this rule has not been thoroughly implemented, for a variety of reasons. Thus it is more than likely, especially in large database systems, that some field or set of fields may have more than one meaning under various circumstances. Often, these differences in meaning will be indicated by the values of other associated fields.

As an example of this type of relationship, consider the case of the ITEM PART NUMBER field of the PCASS table FMEA PARTS in the FMEA dataset one-more time. This field can have many meanings depending on the value of the PART TYPE field in the same table. If PART TYPE is set to “LRU”, the ITEM PART NUMBER field contains a line replaceable unit part number. If PART TYFE is set to “SRU”, the ITEM PART NUMBER field actually contains a shop replaceable unit part number. Storing both kinds of part numbers in the same structure is convenient. However, in order to use the ITEM PART NUMBER field properly, the user must know how to read and set the PART TYPE field to disambiguate the meaning of any particular instance of the record. Thus, the PART TYPE field in the physical database must hold either an “SRU” or “LRU” flag to indicate the particular meaning desired at any one time.

In the heterogeneous environment, it may be possible to find a different database in which the same two concepts which have been stored in one filed in one database, are stored in separate fields. It may in fact be possible that in one or more databases, only one of the two concepts has been stored. This is certainly the case among the separate data sets which make up the PCASS system. For example, in the PCASCLRU data set, only the “line replaceable unit part number” concept is stored (in the field, CRIT_LRU_PART_#). For this reason, the conceptual level of the data dictionary must include both concepts. Then there must be some appropriate construct within the data definition language of the data dictionary system which could express the constraints under which any particular field had any particular meaning. In order to be useful in raising the level of data location transparency, these conditional semantics must be entered into the data dictionary using this construct.

It is obvious now that the relationship between entries in the conceptual data dictionary and the physical data dictionary is truly many to many (see Figure 3). To implement such a relationship, using relational techniques, a third major structure (in addition to the set of tables supporting the conceptual data dictionary and the set of tables supporting the physical data dictionary) must be developed to mediate this relationship. This structure is described in the next section.

2.3.1 Conceptual – Physical Data Mapping

As an approach to implement this mapping from conceptual to physical structures, a table must be developed which relates every concept with the fields which represent it, and every field with the concepts it represents. This table will consist of tautological statements of the semantic equivalence of physical fields to concepts. A tautology is a logical statement that is true in all contexts and at all times. In thiis approach, the tautologies take the following form (please note that the “==” operator means “is semantically equivalent to”, not “is equal to”):

 normalized field f == field a from location A

 The normalized field f of the above example corresponds directly to an entry in the conceptual data dictionary. We call the field, f, normalized to indicate that it is a standard form. As will be described later, the comparison of values from different databases will be supported by normalizing these values into the representation described in the conceptual data dictionary for the normalized field.

Conditional semantics must now be added to the structure to support discussion. Given a general representation for a tautology, conditional semantics may be represented by adding logical operations to the right side of the equivalence. Assume that a new database, D, has a field, d1, which is equivalent to the normalized field, f, but only when certain other fields have specific values. Logically, we could represent this in the following manner:

normalized field f == field d1 from location D iff
field d2 from location D = VALUE1 AND
field d3 from location D = VALUE2 AND …
field dn from location D opn VALUEn

 In more general terms, the logical statement of the tautology would be as follows:

 R == P iff  E

where R is the normalized field representation, P is the physical field, and E is the set of equivalence constraints which apply to the relation. In our part number example, the following tautologies would be stored in the mapping:

Line Replaceable Unit Part Number == PCASFME.FMEA.PARTS.ITEM_PART_NUMBER iff PCASFME.FMEA.PARTS.PART_TYPE = “LRU”

Shop Replaceable Unit Part Number == PCASFME.FMEA.PARTS.ITEM_PART_NUMBER iff PCASFME.FMEA.PARTS.PART_TYPE = “SRU”

Line Replaceable Unit Part Number == PCASCLRU.CRITICAL_LRU_CRIT_LRU_PART_#

The condition statements are similar to condition statements in the SQL query language. In fact, this similarity is no accident, since these conditions wilt be added to any physical query in which ITEM PART NUMBER is included.

From a user’s point of view, implementing this feature allows the user to create a query over the concept of a line replaceable unit part number without having to know the conditions under which any particular field represents that concept. In addition, by representing the general – concept of a line replaceable unit part number, something the user would be very familiar with, this conceptual mapping technique has also hidden the details of the naming conventions used in each of the physical databases.

2.4.2 Integrating Data Translation Functions Into the Data Dictionary

In the simplest case, the integration of data translation functions into the data dictionary would be a matter of attaching to the data mapping tautologies described above a field which would store an indication of the type of translation which must occur to transform a result from its Location-specific form into the normalized form. This approach can be simplified further by allowing translations at the basal level to be identified by the source and target data types involved, and not recording any further information about the translation. It may not be unreasonable to assume that in certain well-defined domains, most of the translation functions required would be either identity functions or simple basal translation functions.

It is now possible to define completely the data structure required to store any arbitrary physical-conceptual field mapping tautology. The data structure would consist of the following parts:

  • concept field – a single, unique concept which the physical projection represents
  • normalized – a reference to the conceptual data dictionary entry used to represent the concept
  • physical projection – the field or set of fields from the physical data dictionary which under the conditions specified in the equivalence constraints represent the concept
  • equivalence constraints – the conditions under which the physical projection can be said to represent the concept
  • translation function – the function which must be performed on the physical projection in order to transform it into the normalized format of the normalized field

The logical statement of the tautology would be as follows:

R = Ft (P) iff E

where R is the normalized field representation, Ft is the translation function over the physical projection, P, and E is the set of equivalence constraints which apply to the relation. The exact implementation of this data structure would depend on the environment in which the system were to be developed, and would have to be specified in a physical design document. Note that instead of the “==” sign, which was defined above as “is semantically equivalent to”, has been replaced by “=” which means “is equivalent to”, and is a stronger statement. The “=” implies that not only is the left side semantically equivalent to the right, but it is also syntactically equivalent.

Unmanage Master Data Management

Master Data Management is a discipline which tries to create, maintain and manage a single, standardized conceptual information model of all of an enterprise’s data structures. Taking as its goal that all IT systems eventually will be unified under a single semantic description so that information from all corners of the business can be understood and managed as a whole.

In my opinion, while I agree with the ultimate goal of information interoperability across the enterprise, I disagree with the approach usually taken to get there. A strategy that I might call:

  • Data Management with Multiple Masters
  • Uncontrolled/Unmanaged Master Data Management
  • Associative Search on an Uncontrolled Vocabulary
  • Emergent Data Management (added 2015)
  • Master-less Data Management (added 2015)

takes a different approach. The basic strategy is to permit multiple vocabularies to exist in the enterprise (one for each major context that can be identified). Then we build a cross reference of the semantics only describing the edges between these contexts (the “bridging” contexts between organizations within the enterprise), where interfaces exist. The interfaces that would be described and captured in this way would include non-automated ones (e.g., human mediated interfaces) as well as the traditionally documented software interfaces.

Instead of requiring that the entire content of each context be documented and standardized, this approach would provide the touchpoints between contexts only. New software (or business) integration tasks which the enterprise takes on would require new interfaces and new extensions of mappings, but would only have to cover the content of the new bridging context.

Information collected and maintained under this strategy would include the categorization of data element structures as follows:

  1. Data structure syntax and basic manipulations
  2. Origin Context and element Role (for example, markers versus non-markers)
  3. Storage types: transient (not stored), temporary (e.g. staging schemas and work tables), permanent (e.g., structures which are intended to provide the longest storage
  4. “Pass-through” versus “consumed” data elements. Also called “traveller” and “fodder”, these data structures and elements have no meaning and possibly no existence (respectively) in the Target Context.

For data symbols that are just “passing through” one context to another, these would be the traveller symbols (as discussed on one of my permanent pages and in the glossary) whose structure is simply moved unchanged from one context to the next, until it reaches a context which recognizes and uses them. “Fodder” symbols are used to trigger some logic or filter to change the operation of the bridging context software, but once consumed, do not move beyond the bridge.

The problem that I have encountered with MDM efforts is that they don’t try to scope themselves to what is RECOGNIZABLY REQUIRED. Instead, the focus is on the much larger, much riskier effort of the attempted elimination of local contexts within the enterprise. MDM breaks down in the moment it becomes divorced from a practical, immediate attempt to capture just what is needed today. The moment it attempts to “bank” standard symbols ahead of their usage, the MDM process becomes speculative, and proscriptive. The likelihood of wasting time on symbology which ultimately is wrong and unused is very high, once steps past the interface and into the larger contexts are taken.

Uses of Metamorphic Models in Data Management and Governance

In the Master Data Management arena, Metamorphic Models would allow the capture of the data elements necessary to stitch together an enterprise. By recognizing the information needed to pass as markers or to act as travellers, the scope of the data governance task should be reducible to a practical minimum.

Then the data governance problem can be built up only as needed. The task becomes, properly, just another project-related activity similar to Change Control and Risk Management, instead of the academic exercise into which it often devolves.

The scope of data management should focus on and document 100% of the data being moved across interfaces, whether these interfaces are automated or human-performed. Simple data can just be documented, and the equivalence of syntax and semantics captured. Data elements that act as markers for the processes should be recorded. Also all data elements/structures intended merely to make the trip as travellers should be indicated.

This approach addresses the high-value portion of the enterprise’s data structures, while minimizing work on documenting concepts which only apply within a particular context.

What’s in a Name: Not That Much, Actually

The referenced paper is seminal. The comments that appear here are largely unaltered from when I first wrote them back in 1989. I follow this older writing with some additional conclusions, looking back over twenty years of experience working with data.

September 23, 1989:

When parsing a record-based system’s data, the software developer is faced with all of the problems of data structure semantics described by W. Kent (in William Kent, “Limitations of Record Based Information Models”, ACM Transactions on Database Systems 4(1), March 1979. Also John Mylopolous and Michael Brodie (eds), Readings in Artificial Intelligence and Databases, Morgan Kaufman, San Mateo, California, 1989. [20 pp]).

Field naming problems can be handled by naming all fields with a field number, then providing synonyms for all fields. I gave each field a “name” similar to the name of the original system which was possibly meaningless. This name was to allow for maintenance and information mapping between systems. Then, using synonyms I could give a more semantically significant name to the field. The record is just a place keeper – the concept represented is buried in the code supporting the use of the record, or perhaps by agreement (explicit or implicit) among the designers and users of the system. When this agreement is verbal, or worse, implied by training, that’s when the trouble arises: idiosyncratic usage enters the picture, along with the possibly disasterous loss of meaning accompanying the departure of those whose concept is being represented.

November 1, 2009:

This note was just one of several ideas I was toying with as I worked on a thesis paper for my Masters. The project I was working on was to integrate and add expert system capabilities (using Prolog) to an existing business application built on top of COBOL fixed record structures. What it describes is the idea I used to get around the very badly named columns of the COBOL records in order to improve the effectiveness and readability of the Prolog code. The basic trick was to put into the Prolog knowledgebase multiple names for the same data structures and attach to these Prolog structures logic statements that permitted the statement (in nearly human-language terms) of logical constraints.

In later years, I have come to recognize that this problem of naming conventions within code, while important to an extent, is not as important as some practitioners think. The fact of the matter is that the computer could care less what the column name of a table is, or the variable name within a program, etc. For all the computer cares, so long as the programming code references the right data structure at the right moment consistently, the actual references might as well be unique, semantically meaningless numbers.

Naming conventions are for the humans who have to write and maintain the code, or, more generally, who have to directly interact with the data structures. And while there can often be contentious, protracted debate amongst software developers on the “right” naming convention for various situations, in my mind, it is not usually worth the amount of attention it gets during development.

If left to my own devices, then the naming convention I try to impose is as richly semantic as possible. Column names and table names are as close to expressing the intended content, down to including qualifying adjectives, and role names to an appropriate, context-specific noun. The context I select the name from is defined by the context of the problem domain for which the software is being written. I also try to be very consistent in the use of names and name parts from one end to the other of whatever system I’m working on.

If the system already has a naming convention, so long as it can be written down in a set of repeatable rules, I’ll use whatever it is. Oftentimes I find I have to rationalize and standardize terms used previously, due to the fact that at different times, different developers may have used different conventions.

I have participated in efforts at making a universal naming convention, and these have all ultimately hit a wall and been stopped (the reasons for this have been to this point the primary subject of this blog – even if I haven’t explicitly described the scenario yet). Namely, the cross-context politics, long initial duration, required ongoing maintenance activities and ultimately the diminishing returns of such efforts cause them to sink from their own weight.

But even when I have had complete control over the data structure development, and I have had time to craft the “perfect” name for each column, even when I’ve checked and double checked and triple checked that I have consistently applied the same naming convention from one end of the system to the next, once my software has gone into use, it hasn’t taken long for the user community to start redefining the meaning of some aspect of the data structure. Or, the requirement changes and the programming team must change the usage of one of my finely-crafted data structures so that it supports a new meaning, not reflected in that finely crafted name.

This can be frustrating, and it can also pose a long term hazard to the maintenance of the system, as either the original meaning or the new meaning becomes a minority of the usage. But it is not the end of the world, and it does not always break the software if the code is changed to handle the new meaning correctly.

However, it does mean that the actual name of the field no longer reflects the contents it holds. But if the code is working properly, the name no longer matters to the operation of the system. Plus, the maintenance problem such a change presents is also no big deal, so long as the revised meaning is captured in an appropriate dictionary and made available to the programming team for future reference.

Why is this the case? The real truth is that the data structure stores symbols which have a meaning within a context defined by the USERS of the software. The data structures merely represent SYNTAX of the symbols, consisting of the data type of the symbol, and the manipulations of the symbol performed by the code. So long as the manipulations are applied appropriately to the correct part of the syntax, no matter HOW it is named, then the software will manage the MEANING intended by the USERS, despite of, not because of, the naming convention of the data structure.

Hence, what’s in a name used on a data structure? From the computer’s point of view, not so much. From the human’s point of view, since the meaning can change over time, the name shouldn’t be trusted until the code has been reviewed to confirm the content. So there again, not so much…

%d bloggers like this: