Meaning Over Transformation

This entry is probably ahead of the story, but I wanted to start moving into this subject and I’m not yet organized. It should make more sense later on when I’ve explained such things as the “magical” function M() more thoroughly.

Review: The Magical Function “M()”

As a review for those who may not have seen this function previously on this site, I have invented a mysterious and powerful function over all things used as signs by humans. Named the “M()” function, I can apply it to any symbol or set of symbols of any type and it will return what that symbol represents. I call it the “M() function because it takes something which is a symbol and it returns its meaning (that’s all of its meaning).

How Meaning Carries Over Symbol Transformations

When we move information from one data structure to another, we may or may not use a reversible process. By this I mean that sometimes a transformation is a one-way operation because some of the meaning is lost in the transformation. Sometimes this loss is trivial, but sometimes it is crucial. (Alternatively, there can be transformations which actually add meaning through deductive reasoning and projection. SFAT (story for another time))

Whether a transformation loses information or not, there are some interesting conclusions we can illustrate using my magical, mysterious function M(). Imagine a set β of data structure instances (data) in an anchor state. The full meaning of that data can be expressed as M(β). Now imagine a transformation operation T which maps all of the data in β onto a second set of data Δ.

T : β |–> Δ such that for each symbol σ in β, there is a corresponding symbol δ in Δ that represents the same things, and σ <> δ

By definition, since we have defined T to be an identity function over the meaning of β, then we can conclude that if we apply M() before and after the transformation, we will find ourselves with an equivalence of meaning, as follows:

By definition: T(β) = Δ

Hence: M( T(β) ) ≡ M( Δ )

Also, by definition of T(), then M( β )  ≡ M( T(β) )

Finally, we conclude: M( β ) ≡ M( Δ )

Now, obviously this is a trivial example concocted to show the basic idea of M(). Through the manner by which we have defined our scenario, we get an obvious conclusion. There are many instances where our transformation functions will not produce equivalent sets of symbols. When T() does produce an equivalence, we call it a “loss-less” transformation (borrowing a term from information theory) because no information is lost through its operation.

Another relationship we claim can also be defined in this manner is namely that of semantic equivalence.  This should be obvious as well, from reflection, as I was careful above to refer to “equivalence of meaning”, which is really what I mean when I say two things are semantically equivalent. In this situation, we defined T() as an operation over symbols such that one set of symbols were replaced with a different set of symbols, and the individual pairs of symbols were NOT THE SAME (σ <> δ)! In a most practical sense, what is happening is that we are exchanging one kind of data structure (or sign) with another, such that the two symbols are not syntactically equivalent (they have different signs)  but they remain semantically equivalent. (You can see some of my thoughts on semantic and syntactic equivalence by searching entries tagged and/or categorized “equivalence” and “comparability“.)

A quick example might be a data structure holding a person’s name. Let’s say that within β the name is stored as a string of characters in signature order (first name  middle name  last name) such as “John Everett Doe”. This symbol refers to a person by that name, and so if we apply M() to it, we would recognize the meaning of the symbol to be the thought of that person in our head. Now by applying T() to this symbol, we convert it to a symbol in Δ, also constructed from a string data structure, but this time the name components are listed in phone directory order (last name, first name middle name) such as “Doe, John Everett”. Clearly, while the syntactic presentation of the transformed symbol is completely different, the meaning is exactly the same.

T(“John Everett Doe”) = “Doe, John Everett”

M( T(“John Everett Doe”) ) ≡ M( “Doe, John Everett” )

M( “John Everett Doe” ) ≡ M( T(“John Everett Doe”) )

M( “John Everett Doe” ) ≡ M( “Doe, John Everett” )

“John Everett Doe” <> “Doe, John Everett”

When the transformation is loss less, there is a good chance that it is also reversible, that an inverse transformation T ‘ () can be created. As an inverse transformation, we would expect that T ‘ () will convert symbols in Δ back into symbols in β, and that it will also carry the meaning with complete fidelity back onto the symbols of β. Hence, given this expectation, we can make the following statements about T ‘ ():

T ‘ (Δ) = β

M( T ‘ (Δ) ) ≡ M( β )

By definition of T ‘ (), then M( Δ )  ≡ M( T ‘ (Δ) )

And again: M( Δ ) ≡ M( β )

Extending our example a moment, if we apply T ‘ () to our symbol, “Doe, John Everett”, we will get our original symbol “John Everett Doe”.

Meaning Over “Lossy” Transformation

So what happens when our transformation is not loss-less over meaning? Let’s imagine another transformation which transforms all of the symbols σ in β into symbols ε in Ε. Again, we’ll say that σ <> ε, but we’ll also define T ‘ ‘ () as “lossy over meaning” – which just indicates that as the symbols are transformed, some of the meaning of the original symbol is lost in translation. In our evolving notation, this would be stated as follows:

T ‘ ‘ (β) = Ε

M( T ‘ ‘ (β) ) ≡ M( Ε )

However, by the definition of T ‘ ‘ (), then M( β )  !≡ M( T ‘ ‘ (β) )

Therefore: M( β ) !≡ M( Ε )

In this case, while every symbol in β generates a symbol in Ε, the total information content of Ε is less than that in B. Hence, the symbols of the two sets are no longer semantically equivalent. With transformations such as this, the likelihood that there is an inverse transformation that could restore β from Ε becomes more unlikely. Logically, it would seem there could be no circumstances where β could be reconstituted from Ε alone, since otherwise the information would have been carried completely across the transformation. I don’t outright make this conclusion, however, since it depends on the nature of the information lost.

An example of a reversible, lossy transformation would include the substitution of a primary key value for an entire row of other data which in itself does not carry all of the information for which it is a key, but which can be used in an index fashion to recall the full set of data. For example, if we created a key value symbol consisting of a person’s social security number and last name, we could use that as a reference for that person. This reference symbol could be passed as a marker to another context (from β to Ε, say) where it could be interpreted only partially as a reference to a person. But which person and what other attributes are known about that person in the new context Ε if we define the transformation in such a way that all of the symbols for these other attributes stay in β? Not much, making this transformation one where information is “lost” in Ε.  However, due to its construction from β, the key symbol could still be used on the inverse transformation back to β to reconstitute the missing information (presuming β retains it).

An example of a one-way transformation might be one that drops the middle name and last name components from a string containing a name. Hence, T ‘ ‘ ( “John Everett Doe” ) might be defined to result in a new symbol, “John”. Since many other symbols could map to the same target, creating an inverse transformation without using other information becomes impossible.

Advertisements

Brass Tacks and Comparability

So I thought I should try to explain “comparability” very simply. Reading my previous posts, which were derived from larger texts, I spend a lot of time saying a lot of generalities, and I think the main point is getting missed. So here’s me getting down to brass tacks on the subject.

A computer CPU is a very basic electrical device. Send it a stream of electrons and a command to “add”, and it returns another stream of electrons representing a purely “mechanical” (i.e., unintelligent) electrical result. That CPU doesn’t know anything about semantics, or whether the switches and gates it opens and closes should appropriately be applied to those particular data streams. It just does what it was designed to do given that particular sequence of electron streams. If the streams are comparable before they get to the CPU, then the output will be meaningful. If they are not comparable, then the output (and being a CPU, there will be some output) will not be meaningful.

So the job of the software is to manipulate each symbol before presenting it to the CPU. In particular, the software needs to take each symbol and replace it with one that MEANS the same as the original symbol, but which will present itself to the CPU as COMPARABLE to the other symbols.

Comparability has to be put into the computer, through the software, by a human being. In particular, it is the human who understands when one data stream is not comparable to another, and it is the human being who writes the code to change one stream so that it becomes comparable to the other.

So what really are we talking about? Let me make a non-computer example to show the point.

2 + 00000010 = IV

If I take a pencil and write the above string of characters on a piece of paper, and show it to another computer programmer, after a few moments, I would expect that person to agree that this is a correct mathematical statement

 two plus two equals four

Part of the success of the person in understanding the original statement is that they are able to parse each symbol in the string, interpret the MEANING of each symbol, then translate each into COMPARABLE numeric ideas.

If the computer CPU could experience each symbol as I’ve written it (let’s agree that each of the symbols depicted here would have similar diversity of structure in the computer as they do here on the page), then we can immediately grasp what comparability is. The CPU does not know what the symbols mean, it cannot make the interpretation just by looking at the symbols as they are presented and come to the same conclusion as the human. 

If we look at what I, the human did, to provide you, the reader, with a more readable version of the equation, I replaced each symbol with another one that meant the same, but which appeared as mutually comparable symbols:

  • 2   –>  two
  • +  –>  plus
  • 00000010  –>  two
  • =  –>  equals
  • IV  –>  four

Before the CPU can compare the symbol “2” to the symbol “00000010”, they must both be replaced with two other symbols, each with the standard interpretation of “two”. These new symbols must be structured to flow through the CPU in such a way that their very structure is modified by the CPU to create a third symbol whose standard interpretation has the meaning “four”. The “plus” symbol must be translated into the CPU’s “ADD” instruction, and the “equals” symbol is represented by the stream of electricity leaving the CPU with the resulting symbol.

What is “Comparability”?

What is “comparability“? Basically it is a relationship between two things. If two things are “comparable“, in general parlance, then they are similar in some aspects. They share common features or functions. They are not “equal” necessarily, as there may be important differences between them. In fact, it may be that the interesting aspect of the comparison made between the two objects is in their difference, more than in their similarity. However, a test for equivalence is a very common comparison to make for things that are comparable.

Typically, the comparison will be made with respect to some common constraint, from a particular point of view, or within a particular context. Any two things can be compared, although the meaningfulness an dutility of the comparison is not always guaranteed. The most meaningful/useful comparisons will occur within a context where the two things are strongly similar.

For a simple example, consider comparing ants and humans. In order to do this meaningfully, a context for comparison must be established, and a set of common properties must be recognized. Comparing the “wing span” property of ants an dhumans would be a meaningless comparison, since humans have no wings, and most ants do not either. Comparing the anatomy of an example of each type of creature might form a context where the property “number of limbs” could generate a meaningful result.

Comparing the “strength” property of a human versus an ant may also be meainingless or at least misleading. The absolute strength of the human will be much higher than the absolute strength of the ant. However, comparing the “strength relative to weight” of each creature can tell us something much more interesting. The relative strength, where the weight of objects each creature can pick up is divided by the body mass (weight) of the creature.

Hence, while comparing absolute strength between ants and humans is meaningful, it is not terribly useful. Once the relative strength has been calculated, a meaningful and potentially useful comparison can be made, giving us an “apples to apples” comparison. By adjusting the strength property of each creature, we have created a comparison which is both meaningful and useful.

In this example, it is useful from the standpoint that the comparison is more understandable.

We have effected this improvement in the meaningfulness by establishing the context of comparison through the application of functions to the values of the creatures native properties. In other words, we have applied similar “conversion” functions in similar ways to the ant and human “strength” and “weight” properties to derive two new properties which are <em>more</em> comparable than each of the original values on their own.

The approach we took was to find where two things are analogous – where their similarties lie – and then to translate their analogous properties into meaningful and useful new values which can be compared.

The idea expressed by the term “comparability” implies that there will be similarities between the things compared. It also presupposes the expectation if not the a priori knowledge that there will be some differences, and that the differences between analogous properties can provide insight and knowledge.

Comparability: How Software Works

Back in 1990, I was working on a contract with NASA building a prototype database integration application. This was the dawn of the Microsoft Windows era, as Windows 3.0 had just been released (or was about to be). Oracle was still basically a start-up relational database vendor trying to reach critical mindshare. The following things did not yet exist which we take for granted today (and even think of as kind of out dated):

  • ODBC – allowing standardized access to databases from the desktop
  • Microsoft Access and similar personal data management utilities
  • Java (in fact most of the current web software stack was still just the twinkles in the eyes of their subsequent inventors)
  • Message-based engines, although EDI techniques existed
  • SOA and XML data formats
  • Screen-scrapers, user simulators, ETL utilities…

The point is, it was still largely a research project just to connect different databases that an enterprise might be using. Not only did the data representational difficulties that we face today exist back in 1990, but there was also a complete lack of infrastructure to support remote connection to databases: from network communication protocols, to query interfaces, to security and session continuity functions, even to standardized query languages (SQL was not the dominant language for accessing data back then), and more.

In this environment, NASA had asked us to prototype a generic capability that would permit them to take user search criteria, and to query three different database applications. Then, using the returned results from the three databases, our tool was to generate a single, unified query result.

While generally a successful prototype, during a critical review, it became clear to NASA and to us that maintaining such an application would be horribly expensive, so the research effort was ended, and the final report I wrote was delivered, then put into the NASA archives. It is just as well too, because within five years, much of the functional capabilities we’d prototyped had started to become available in more robust, standards-based commercial products.

What follows is a handful of excerpts from the final report, which while now out of context, still expresses some important ideas about how software symbols actually work. The gist of the excerpt describes how software establishes the comparability and sometimes the equivalence of meaning of the symbols it manipulates.

In a nutshell, software works with memory addresses with particular patterns of voltage (or magnetic field direction) representing various concepts from the human world. Software is constantly having to compare such “structures” together in order to establish either equivalence of meaning, or to alter meaning through the alteration of the pattern through heavily constrained manipulations. The key operation for the computer, therefore, is to establish whether or not two symbols are “comparable“. If they are not comparability, quite literally, then the computer cannot reliably compare them and produce a meaningful result.

Without further ado, here are the important excerpts from the research study’s final report, which I wrote and delivered to NASA in November 1990.

“Database Integration Graphical Interface Tools, Future Directions and Development Plan”, Geoff Howe, November 1990

2.2 The Comparability of Fields

There are many kinds of comparisons that can be made among fields. In databases, the simplest level of comparability is at the data type level. If two fields have the same simple data type (e.g., integer, character, fixed string, real number), then they can be compared to each other by a computer. This level of comparability is called “basal comparability”. Thus, if fields A and B are both integers, they can be combined, compared and related in any way appropriate for two integers.

However, two elements meeting the qualification for basal comparability may still be incomparable at the next level, that of the syntactic level. The syntactic level of comparability is that level in which the internal structure of a field becomes important. Examples of internal formats which might matter and might be important at this level include date formats, identification code formats, and string formats. In order to compare two fields in different formats, one or the other of these fields would have to be converted into the other format, or else both would have to be converted into a third format. The only meaningful comparisons that can be made among the fields of a database or databases must be made at the syntactic level.

As an example, suppose A is a field representing a date in Julian format, and suppose B is a field representing a date in Gregorian format. Assuming that both fields are stored as integers, comparing these dates would be meaningless because they lack the same syntactic structure. In order to compare these dates one or the other of these dates would have to be converted into the other format, or else both would have to be converted into a third format.

Unfortunately, having the same syntactic structure is not a guarantee that two fields can be compared meaningfully by a computer process. Rather, syntactic comparability is the minimum requirement for meaningful comparison by computer process. Another form of comparability must be incorporated as well, that of semantic comparability. Semantic comparability is based on the equivalence of the meanings attached to the contents of some pair of data items. The semantics of data items are not readily available to computer processes directly; a separate description in some form must be used to allow the computer to understand the semantic equivalence of concepts. Once such representation is in place, the computer should be able to reason over the semantic equivalence of concepts.

As an example of semantic comparability consider the PCASS fields, ITEM PART NUMBER from the FMEA PARTS table of the PCASFME subsystem, and CRIT_LRU_PART_# from the CRITICAI LRU table of the PCASCLRU subsystem. Under certain circumstances, both of these fields will hold the part numbers of “line replaceable units” or LRUs. Hence, these fields are semantically comparable. Given a list of the contents of ITEM PART NUMBER, and a similar list for CRIT LRU PART #, the assumption can be made that some of the same “line replaceable units” will be referenced in both lists.

Semantic comparability is useful when integrating data from different databases because it can be used to indicate the equivalence of concepts. Yet, semantic comparability does not imply syntactic comparability, and thus both must be present in order to satisfactorily integrate the values of fields from different databases. A definition of the equivalence of fields across databases can now be offered. Two fields are equivalent if they share the same base type; if their internal syntactic structure is the same; if their representational domains are the same; and if they represent the same concept in all contexts.

2.3 Heterogeneous Data Dictionary Architecture

 The approach which seems to have the most documentary support in the research for solving the integration of heterogeneous distributed databases uses a two-tiered data dictionary to support the construction of location-independent queries. The single data dictionary, used by both the single-site database management system, and the homogenous distributed environment, is split in two across the physical-conceptual boundary. This results in a two-level dictionary where one level describes in detail the physical fields of each integrated database, and the second level describes the general concepts stored across systems. For each unique concept represented by the physical level., there would be an entry in the conceptual level data dictionary describing that concept. Figure 2 shows the basic architecture of the two level data dictionary.

As an example of the difference between the conceptual and physical data dictionary levels, consider again the field PCASFME.FMEA PARTS.ITEM PART NUMBER. This is the full name of the actual field in the PCASS database. The physical level of the data dictionary would have this full name, plus the details of how this field is represented (character string, twelve places long). The conceptual level of the data dictionary would contain a description of the contents of the field, and a conceptual field name, “line replaceable unit part number”. Other fields in other tables of PCASS or in other databases may also have the same meaning. This fact poses the problem of mapping the concept to the physical field, which will be described below. Notice, however, how much easier it would be for a user to be able to recall the concept “line replaceable unit part number”, as opposed to the formal field name. This ease of recall is one of the major benefits of the two-level data dictionary being proposed. Two important relationships exist between the conceptual and physical data dictionaries. One of the relationships between fields of the conceptual level data dictionary and fields of the physical level data dictionary can be characterized as one-to-many. That is, one concept in the conceptual data dictionary could have many physical implementations. Identification of this type of relationship would be a matter of identifying and recording the semantic equivalences across system boundaries among fields at the physical level. All physical fields sharing the same meaning are examples of this one-to-many relationship.

Within the PCASS system, the concept of a line replaceable unit part number” occurs in a number of places. It has already been mentioned that both the ITEM PART NUMBER field of the FMEA_PARTS table, and the CRIT LRU PART # field of the CRITICAI_LRU table, represent this concept. The relationship between the concept and these two fields is, therefore, one-to-many.

The second type of relationship which may also be present, depending on the nature of the existing databases, relates several different concepts to a single field. This relationship is characterized as “many-to-one”. Systems which have followed strict database design rules should result in a situation where every field of the database represents one and only one concept. In practical implementations, however, it is often the case that this rule has not been thoroughly implemented, for a variety of reasons. Thus it is more than likely, especially in large database systems, that some field or set of fields may have more than one meaning under various circumstances. Often, these differences in meaning will be indicated by the values of other associated fields.

As an example of this type of relationship, consider the case of the ITEM PART NUMBER field of the PCASS table FMEA PARTS in the FMEA dataset one-more time. This field can have many meanings depending on the value of the PART TYPE field in the same table. If PART TYPE is set to “LRU”, the ITEM PART NUMBER field contains a line replaceable unit part number. If PART TYFE is set to “SRU”, the ITEM PART NUMBER field actually contains a shop replaceable unit part number. Storing both kinds of part numbers in the same structure is convenient. However, in order to use the ITEM PART NUMBER field properly, the user must know how to read and set the PART TYPE field to disambiguate the meaning of any particular instance of the record. Thus, the PART TYPE field in the physical database must hold either an “SRU” or “LRU” flag to indicate the particular meaning desired at any one time.

In the heterogeneous environment, it may be possible to find a different database in which the same two concepts which have been stored in one filed in one database, are stored in separate fields. It may in fact be possible that in one or more databases, only one of the two concepts has been stored. This is certainly the case among the separate data sets which make up the PCASS system. For example, in the PCASCLRU data set, only the “line replaceable unit part number” concept is stored (in the field, CRIT_LRU_PART_#). For this reason, the conceptual level of the data dictionary must include both concepts. Then there must be some appropriate construct within the data definition language of the data dictionary system which could express the constraints under which any particular field had any particular meaning. In order to be useful in raising the level of data location transparency, these conditional semantics must be entered into the data dictionary using this construct.

It is obvious now that the relationship between entries in the conceptual data dictionary and the physical data dictionary is truly many to many (see Figure 3). To implement such a relationship, using relational techniques, a third major structure (in addition to the set of tables supporting the conceptual data dictionary and the set of tables supporting the physical data dictionary) must be developed to mediate this relationship. This structure is described in the next section.

2.3.1 Conceptual – Physical Data Mapping

As an approach to implement this mapping from conceptual to physical structures, a table must be developed which relates every concept with the fields which represent it, and every field with the concepts it represents. This table will consist of tautological statements of the semantic equivalence of physical fields to concepts. A tautology is a logical statement that is true in all contexts and at all times. In thiis approach, the tautologies take the following form (please note that the “==” operator means “is semantically equivalent to”, not “is equal to”):

 normalized field f == field a from location A

 The normalized field f of the above example corresponds directly to an entry in the conceptual data dictionary. We call the field, f, normalized to indicate that it is a standard form. As will be described later, the comparison of values from different databases will be supported by normalizing these values into the representation described in the conceptual data dictionary for the normalized field.

Conditional semantics must now be added to the structure to support discussion. Given a general representation for a tautology, conditional semantics may be represented by adding logical operations to the right side of the equivalence. Assume that a new database, D, has a field, d1, which is equivalent to the normalized field, f, but only when certain other fields have specific values. Logically, we could represent this in the following manner:

normalized field f == field d1 from location D iff
field d2 from location D = VALUE1 AND
field d3 from location D = VALUE2 AND …
field dn from location D opn VALUEn

 In more general terms, the logical statement of the tautology would be as follows:

 R == P iff  E

where R is the normalized field representation, P is the physical field, and E is the set of equivalence constraints which apply to the relation. In our part number example, the following tautologies would be stored in the mapping:

Line Replaceable Unit Part Number == PCASFME.FMEA.PARTS.ITEM_PART_NUMBER iff PCASFME.FMEA.PARTS.PART_TYPE = “LRU”

Shop Replaceable Unit Part Number == PCASFME.FMEA.PARTS.ITEM_PART_NUMBER iff PCASFME.FMEA.PARTS.PART_TYPE = “SRU”

Line Replaceable Unit Part Number == PCASCLRU.CRITICAL_LRU_CRIT_LRU_PART_#

The condition statements are similar to condition statements in the SQL query language. In fact, this similarity is no accident, since these conditions wilt be added to any physical query in which ITEM PART NUMBER is included.

From a user’s point of view, implementing this feature allows the user to create a query over the concept of a line replaceable unit part number without having to know the conditions under which any particular field represents that concept. In addition, by representing the general – concept of a line replaceable unit part number, something the user would be very familiar with, this conceptual mapping technique has also hidden the details of the naming conventions used in each of the physical databases.

2.4.2 Integrating Data Translation Functions Into the Data Dictionary

In the simplest case, the integration of data translation functions into the data dictionary would be a matter of attaching to the data mapping tautologies described above a field which would store an indication of the type of translation which must occur to transform a result from its Location-specific form into the normalized form. This approach can be simplified further by allowing translations at the basal level to be identified by the source and target data types involved, and not recording any further information about the translation. It may not be unreasonable to assume that in certain well-defined domains, most of the translation functions required would be either identity functions or simple basal translation functions.

It is now possible to define completely the data structure required to store any arbitrary physical-conceptual field mapping tautology. The data structure would consist of the following parts:

  • concept field – a single, unique concept which the physical projection represents
  • normalized – a reference to the conceptual data dictionary entry used to represent the concept
  • physical projection – the field or set of fields from the physical data dictionary which under the conditions specified in the equivalence constraints represent the concept
  • equivalence constraints – the conditions under which the physical projection can be said to represent the concept
  • translation function – the function which must be performed on the physical projection in order to transform it into the normalized format of the normalized field

The logical statement of the tautology would be as follows:

R = Ft (P) iff E

where R is the normalized field representation, Ft is the translation function over the physical projection, P, and E is the set of equivalence constraints which apply to the relation. The exact implementation of this data structure would depend on the environment in which the system were to be developed, and would have to be specified in a physical design document. Note that instead of the “==” sign, which was defined above as “is semantically equivalent to”, has been replaced by “=” which means “is equivalent to”, and is a stronger statement. The “=” implies that not only is the left side semantically equivalent to the right, but it is also syntactically equivalent.

%d bloggers like this: