The Common Features of Data Integration Tools

The tools available in the marketplace for data integration are diverse. To say that there was a standard set of required features for data integration tools would be a bit of a stretch. There is little, at the present moment, in the way of recognition that there are common features and problems in the data integration space. This is due to the fact that companies are not buying products for their ability to unify and integrate their data alone, but rather to solve some other class of problem.

On the other hand, there is a lot of commonality, both in functionality and in presentation or user interaction, among tools in very different tool categories. A certain core set of features appear again and again, and a common graphical depiction has also become nearly ubiquitous among the products.

This stereotypical user interface consists of one or more box with a list of data element names stacked vertically, and then the provided ability to connect individual columns from one box to individual columns in a second box by drawing lines between the boxes. Some of the common features of data integration tools include: a data dictionary for the schemas of the company’s applications,automated or semi-automated processes for capturing the basic schema information about these applications,and some way of linking or tying data elements from one schema to another.

Many products tout their inherent architecture as a major benefit, namely that their product presents some sort of semantic “centralized hub and spoke” model. Key features of this architecture, in addition to the typical features described above, are a language or representation for building a common, unified data (or information) model (e.g., Common Information Model) spanning the data structures of the application systems of the corporation, a technique and notation for relating the application data structures to this unified model, and the nearly universal marketing pitch touting how the centralization reduces and removes the redundancies and inefficiencies inherent in any alternate design not using their centralized hub approach.


re: On the “Significance of Numbers”

Clinton Mah wrote a post (SemanticHacker: Significance of Numbers) regarding probabilities and semantics where he used an example of a colleague eating a can of soup, but not realizing implications of the message on the label. This was my response:

Context is always the key. In your example, the number had units (mg) so you have a beginning of the context. Then you also have the ingredient listed (sodium) and the fact that this was on a can of soup, so presumably part of the context is “a person eating a can of soup”. In your example, it is the broader context that provides the meaning to the number.

This broader context includes the fact that the government has mandated that the can be labelled, and that the label indicate the amount of sodium. The government mandated this label in order that the manufacturer communicate to the customer this measurement. As a consumer of the soup, asside from having to read the can’s label, I have to recall that there is a context mandating the label. But the trouble here is not that I don’t know there’s a context, it is rather that I’m not completely “read-in” to that context. I don’t remember that the actual daily recommended amount of sodium is 900 mg. Even if I don’t remember the daily serving, I can presume that there must be one that I could compare to the measurement indicated on the label, due to the fact that there is the label and it has that measurement.

My experience of the context “eating a can of soup in an environment where the government has forced the manufacturer to report a sodium amount to me” may be incomplete. This just goes to show that I can exist and interact with a context, even if I’m not a principal player in defining it.

Presumably, the meaning of the probability associated with a term should tell me (and any software using the number) of a relative confidence in a particular “interpretation” of the term. But this interpretation of the number (not the term) is general. It does not tell me the actual meaning of the term in the “term’s context”, it merely associates the term to several other terms telling me the one’s it is most likely associated to given other instances of the term experienced earlier.

This meaning of the probability applies to every such probability number in the software system.

Did I miss your point? The meaning of words is context dependent. By capturing your “semantic signature” you really are capturing an approximation of that context, but you have not captured the meaning of the term, and the probability numbers have not either.

Yes, the more terms I put into my search, the more likely it is that I will find other documents from the same context. But as you say, the training set must have included enough samples from that context to make a statistically useful estimate (else I won’t find my context correctly).

I don’t know, I think there must be something a bit different between the actual, experiential semantics in my head and that statistical estimation of semantics that you are describing here. At least in magnitude if not in kind.

Context is Knowing Who You Are Talking To

A few years ago I ran across some very interesting research into the origins of language performed by Luc Steels at the Artifical Intelligence Laboratory of Virje Universiteit Brussel (See “Synthesising the Origins of Language and Meaning Using Co-Evolution, Self-Organisation and Level Formation“, Luc Steels, July 26, 1996. ). He basically set up some robots and had them play what he called “language games”.  Here’s part of the Abstract:

The paper reports on experiments in which robotic agents and software agents are set up to originate language and meaning. The experiments test the hypothesis that mechanisms for generating complexity commonly found in biosystems, in particular self-organisation, co-evolution, and level formation, also may explain the spontaneous formation, adaptation, and growth in complexity of language. Keywords: origins of language, origins of meaning, self-organisation, distributed agents, open systems. 1 Introduction A good way to test a model of a particular phenomenon is to build simulations or artificial systems that exhibit the same or similar phenomena as one tries to model. This methodology can also be applied to the problem of the origins of language and meaning. Concretely, experiments with robotic agents and software agents could be set up to test whether certain hypothesised mechanisms indeed lead to the formation of language and the creation of new meaning.

Interestingly enough, I ran into this work a couple years after I had written down an early musing about context (see my earlier post “The origin of a context“). At the time that I ran into Luc Steels research, I was struck with how similarly I had framed the issue. While his experiments were about much more than context, it certainly was encouraging to me that the results of the experiments he carried out corroborated my naive expressions.

Apparently, a few years later (1999-2000), the Artificial Intelligence Laboratory at Vrije University continued the experiment, including a much richer experimental and linguistic setup than the original work. The introduction to this further research (apparently funded in part by Sony) even depicts a “robot” conversation very much like the conversation I describe in my post (only with better graphics…)

The basic setup of the experiment was as follows. Two computers, two digital cameras, and microphones and speakers permitting the software to “talk”. The cameras had some sort of pointing mechanism (laser pointers, I think) and faced a white board on which various shapes of different colors were arrayed randomly. The two software agents took turns pointing their laser pointers at the shapes and then generating various sounds. As the game continued, each agent would try to mimic the sounds they heard each other make while pointing at specific objects. Over time, the two agents were able to replicate the sounds when pointing at the same objects.

In terms of what I consider to be context, these experiments showed that it was possible for two “autonomous agents” to come to agreement on the “terminology” they would mutually use to refer to the same external perceptions (images of colored objects seen through a digital camera). Once trained, the two agents could “converse” about the objects, even pointing them out to each other and correctly finding the objects referred to when mentioned.

These experiments also showed that if you take software agents who have been trained separately (with other partner agents) and put them together, they will go through a period of renegotiation of terms and pronunciations. The robot experiments show a dramatic, destructive period in which the robots almost start over, generating an entirely new language, but finally the two agents again converge on something they agree on.

I’m not sure if the study continued to research context, per se. The later study included “mobile” agents and permitted interactions with several agents in a consecutive fashion. This showed the slow “evolution” of the language (a convergence of terminology amongst several agents) among a larger group of agents. I suspect, that unless the experimenters explicitly looked for this, they may have missed this detail (I’d be interested in finding out).

What would have been terrific is if the agent kept track of WHO it was talking to as well as what was being talked about. It is that extra piece of information which makes up a context. If an agent were able to learn the terminology of one agent, then learn the language of another, it could act as a translator between the two by keeping track of who it was talking with (and switching contexts…). Under my view, context is just the recognition of who I’m talking to and thus the selection of the correct variant of language and terminology to adapt to that audience.

Human ability to switch contexts so easily is due to our ability to remember how various concepts are described amongst specific communities. I’ve always said, until the computer can have a conversation with me and then come up with its own data structures to represent the concepts we discuss, I’ll be employed… Now I’m getting a little worried…

%d bloggers like this: