You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Paolo Castagna <ca...@googlemail.com> on 2012/04/05 23:37:06 UTC

Pivotal data conversion/integration... (Was: Mapping Ontologies)

Hi Bernie

Bernie Greenberg wrote:
> [...] are you trying to "union" two or more web knowledge databases
> representing parts of the same knowledge? I found this a thankless task.

'thankless' is my new word today. :-)

To understand what you mean, I needed to go to a common place for the English
language (i.e. a dictionary) and read the definition (which fortunately, uses
words I already know).

I agree with you on the adjective, it is thankless.

RDF itself in relation to information|data|knowledge integration do not offer
IMHO particular advantages on a 'semantic' level, in particular if|when people
use different vocabularies|schema|ontologies. RDF provides help for merging
datasets at a sort of 'syntactical' level, that is trivial (and it gives you
time to think about the 'semantic' :-)). If the data you need to merge is using
same vocabulary|schema|ontology you are almost done. Otherwise, you are left on
your own, practically. This is just my humble opinion.

By the way, people often disagree on how to model the same thing or how to map
between two ontologies (or translate between two languages)... or how to name
the same thing with the same name (or URI) or on the notion of "same thing".
Trying to automate these tasks is thankless^2.

In relation to data integration/conversion, one approach I think works very well
is what Wikipedia calls 'pivotal conversion' [1]. Data integration and data
conversion between N different formats (or N different languages) is an N^2
problem. But, it can be reduced to an linear one simply adopting a core/common
language. English for humans, TCP/IP for Internet, ? for data.

With a pivotal data conversion/integration approach, it's very cheap to add a
new format to your system, in particular if it is possible to transform from one
format into another without loosing information. You just need to convert
from/to a common format only. If you do that automatically you gain the
conversion from/to all the other formats in the system.

Why more and more people speak English? Because everybody else does it and this
is the easiest way to communicate with everybody else. Unfortunately, human
language is not as precise as other type of communication formats, when you go
back and forward you lose information and translating from one language to
another is not a precise process.

RDF as well as OWL ontologies can be used in this way as core/common data
format. This is easier on a syntactic level and it can become harder and
imprecise as the expressive power of your language grows. However, you can still
map external OWL ontologies to your own view of the world, your own internal
core ontology. When you do that, your RDF toolbox has tools which allow you to
translate RDF data described with an external ontology in data you can easily
integrate and transform into other ontologies.

To make things less abstract, here are three IMHO good examples of pivotal data
integration|conversion:

- Hojoki: Make All Your Cloud Apps Work As One
http://hojoki.com/

- Open Services for Lifecycle Collaboration
http://open-services.net/ and http://eclipse.org/lyo/

- SIMILE | Babel
http://service.simile-widgets.org/babel/

Hojoki is really cool and you can measure how fast they keep adding new
services, each time adding more and more value for their users. For them, adding
a new service is easy. A beautiful example of pivotal data/service integration.

The video at the bottom of the http://eclipse.org/lyo/ page could have be done
by Google (promoting RDF without ever mention it. ;-)). It made me remeber:
http://www.youtube.com/watch?v=TJfrNo3Z-DU ... unfortunate IMHO Google bought
them. I do not see the Freebase datadumps growing massively as they could (being
Google). But, then... why sharing? Let's all give Google more data via
schema.org and maybe they'll give it back... in HTML :-/ Ops, ...

Babel is not 'active' anymore AFAICT. I did not want to let it die, so I've
stolen it and put it on GitHub [2] (also it is using Apache Jena now). It's much
more limited as I've spent only a few hours on it.

You just have two interfaces to implement to add a new tabular data format:
https://github.com/castagna/babel2/blob/master/apis/src/main/java/org/apache/jena/babel2/BabelReader.java
https://github.com/castagna/babel2/blob/master/apis/src/main/java/org/apache/jena/babel2/BabelWriter.java

The SemanticType.java interface is trying to capture the 'semantic' axis:
https://github.com/castagna/babel2/blob/master/apis/src/main/java/org/apache/jena/babel2/SemanticType.java
Currently, there is only GenericType.java which implements SemanticType and it
is a sort of 'tabular' data. But, nothing stops you to add more or more complex
SemanticType: for example, you could rapresent graph data instead of tables, or
go one level up and represent people, cars, etc. or one level up and represent
knowledge domains such as: "food" or "sport".

To conclude, the pivotal approach to data conversion/integration keeps the costs
of adding new serialization formats or new data formats low and manageable. Each
time you add a new data format the overall value of your integration software
grows (quadratically?).

This approach can be applied independently from RDF or OWL, there is nothing
magic with RDF or OWL. However RDF gives you a powerful and flexible data model
which can be easily adopted at the core of such systems and OWL (as well as
SPARQL or other tools such as SPIN) gives you powerful ways to transform your data.

Something that was a thankless can become almost pleasant. ;-)

Paolo

[1] http://en.wikipedia.org/wiki/Data_conversion#Pivotal_conversion
[2] https://github.com/castagna/babel2/ (feel free to fork it, if you find it
useful and send pull request if you improve it)

Re: Pivotal data conversion/integration... (Was: Mapping Ontologies)

Posted by Bernie Greenberg <bs...@basistech.com>.

Paolo, thank you so much for these insights and pointers.  Perhaps
"thankless" (which implies that ardent efforts were unrewarded by others)
was not even the right word.  What I was trying to convey was my
discouragement on finding the ubiquity and variety of databases, each
claiming authenticity, breadth, veracity, ease-of-use, produced by
independent entities, revealing no schema/ontology consistency (or reason
to expect to have such) with the others.  It reminded me of my days as a
schoolboy, long before the internet, when students resorted to libraries to
research papers, and had to read four or five different encyclopaedias and
search shelves for books written perhaps decades and continents apart, none
of which were expected to have any commonality with the others except an
accurate reporting of say, Napoleon's rout of the Prussians at Jena, the
eye-structure of an owl, or the way to multiply two matrices.   Each
knowledge-source had its own schema, and part of learning to learn in the
old way was to learn to master and union them to acquire such knowledge.
If you were a graduate student, you might even have to read works in
natural languages not your native one.

The web is a very different thing. The combination of Wikipedia and Google
is amassing authority daily, and it scares me, a one-source knowledge-mart
for all except serious researchers/doctoral students.   The crowd-sourced
or aggregated RDF info hubs resemble human, learned know-it-alls in
different towns and different countries who don't even speak the same
language, and aren't concerned about each other.  While Wikipedia seems the
evolutionary product of a forerunner decade of personal sites about single
facets of the world, dbpedia is nothing like it in scope. accomplishment,
or usefulness. representing only the "info-boxes", and (I have found) even
they are wildly inconsistent in schema, reflecting crowd sourcing.  The
dream of an RDF database with all the real "knowledge" in Wikipedia, which
would put to rest all the other putative know-it-all RDF DB's  as Wikipedia
has "John's Count Leo Tolstoy Site" remains elusive; there is not enough
free labor to schematize and RDFize that much information, and computer
text understanding has not yet reached the level where *reliable *facts can
be automatically harvested from prose and awarded the authority needed
(yes, I am aware of projects and techniques that try to do that).

The "Utopian RDF vision" is clearly the total and automatic integration of
all available RDF knowledge-sources to produce the SPARQL parallel of
Google, complete with all the flaws of crowd-sourcing and
hidden/mixed-reliability-models and dangers that afflict the world's de
facto search engine.  Clearly, that is what that group Rodrigo named is
working on, and every other explorer in their morions and caravels on the
uncharted sea of crowd-soruced RDF.

Enough of my wasting the time of this working readership.  Thanks, Paolo.
My code continues to work well.

Bernie

On Thu, Apr 5, 2012 at 8:40 PM, Rodrigo Jardim <ro...@gmail.com> wrote:

> Paolo,
> your tips were very useful for me.
>
> Thanks very much
>
> --
> Rodrigo
>
>
>
> Hi Paolo,
>
>
> Em 05/04/2012 18:37, Paolo Castagna escreveu:
>
>  Hi Bernie
>>
>> Bernie Greenberg wrote:
>>
>>> [...] are you trying to "union" two or more web knowledge databases
>>> representing parts of the same knowledge? I found this a thankless task.
>>>
>> 'thankless' is my new word today. :-)
>>
>> To understand what you mean, I needed to go to a common place for the
>> English
>> language (i.e. a dictionary) and read the definition (which fortunately,
>> uses
>> words I already know).
>>
>> I agree with you on the adjective, it is thankless.
>>
>> RDF itself in relation to information|data|knowledge integration do not
>> offer
>> IMHO particular advantages on a 'semantic' level, in particular if|when
>> people
>> use different vocabularies|schema|**ontologies. RDF provides help for
>> merging
>> datasets at a sort of 'syntactical' level, that is trivial (and it gives
>> you
>> time to think about the 'semantic' :-)). If the data you need to merge is
>> using
>> same vocabulary|schema|ontology you are almost done. Otherwise, you are
>> left on
>> your own, practically. This is just my humble opinion.
>>
>> By the way, people often disagree on how to model the same thing or how
>> to map
>> between two ontologies (or translate between two languages)... or how to
>> name
>> the same thing with the same name (or URI) or on the notion of "same
>> thing".
>> Trying to automate these tasks is thankless^2.
>>
>> In relation to data integration/conversion, one approach I think works
>> very well
>> is what Wikipedia calls 'pivotal conversion' [1]. Data integration and
>> data
>> conversion between N different formats (or N different languages) is an
>> N^2
>> problem. But, it can be reduced to an linear one simply adopting a
>> core/common
>> language. English for humans, TCP/IP for Internet, ? for data.
>>
>> With a pivotal data conversion/integration approach, it's very cheap to
>> add a
>> new format to your system, in particular if it is possible to transform
>> from one
>> format into another without loosing information. You just need to convert
>> from/to a common format only. If you do that automatically you gain the
>> conversion from/to all the other formats in the system.
>>
>> Why more and more people speak English? Because everybody else does it
>> and this
>> is the easiest way to communicate with everybody else. Unfortunately,
>> human
>> language is not as precise as other type of communication formats, when
>> you go
>> back and forward you lose information and translating from one language to
>> another is not a precise process.
>>
>> RDF as well as OWL ontologies can be used in this way as core/common data
>> format. This is easier on a syntactic level and it can become harder and
>> imprecise as the expressive power of your language grows. However, you
>> can still
>> map external OWL ontologies to your own view of the world, your own
>> internal
>> core ontology. When you do that, your RDF toolbox has tools which allow
>> you to
>> translate RDF data described with an external ontology in data you can
>> easily
>> integrate and transform into other ontologies.
>>
>> To make things less abstract, here are three IMHO good examples of
>> pivotal data
>> integration|conversion:
>>
>>  - Hojoki: Make All Your Cloud Apps Work As One
>>    http://hojoki.com/
>>
>>  - Open Services for Lifecycle Collaboration
>>    http://open-services.net/ and http://eclipse.org/lyo/
>>
>>  - SIMILE | Babel
>>    http://service.simile-widgets.**org/babel/<http://service.simile-widgets.org/babel/>
>>
>> Hojoki is really cool and you can measure how fast they keep adding new
>> services, each time adding more and more value for their users. For them,
>> adding
>> a new service is easy. A beautiful example of pivotal data/service
>> integration.
>>
>> The video at the bottom of the http://eclipse.org/lyo/ page could have
>> be done
>> by Google (promoting RDF without ever mention it. ;-)). It made me
>> remeber:
>> http://www.youtube.com/watch?**v=TJfrNo3Z-DU<http://www.youtube.com/watch?v=TJfrNo3Z-DU>... unfortunate IMHO Google bought
>> them. I do not see the Freebase datadumps growing massively as they could
>> (being
>> Google). But, then... why sharing? Let's all give Google more data via
>> schema.org and maybe they'll give it back... in HTML :-/ Ops, ...
>>
>> Babel is not 'active' anymore AFAICT. I did not want to let it die, so
>> I've
>> stolen it and put it on GitHub [2] (also it is using Apache Jena now).
>> It's much
>> more limited as I've spent only a few hours on it.
>>
>> You just have two interfaces to implement to add a new tabular data
>> format:
>> https://github.com/castagna/**babel2/blob/master/apis/src/**
>> main/java/org/apache/jena/**babel2/BabelReader.java<https://github.com/castagna/babel2/blob/master/apis/src/main/java/org/apache/jena/babel2/BabelReader.java>
>> https://github.com/castagna/**babel2/blob/master/apis/src/**
>> main/java/org/apache/jena/**babel2/BabelWriter.java<https://github.com/castagna/babel2/blob/master/apis/src/main/java/org/apache/jena/babel2/BabelWriter.java>
>>
>> The SemanticType.java interface is trying to capture the 'semantic' axis:
>> https://github.com/castagna/**babel2/blob/master/apis/src/**
>> main/java/org/apache/jena/**babel2/SemanticType.java<https://github.com/castagna/babel2/blob/master/apis/src/main/java/org/apache/jena/babel2/SemanticType.java>
>> Currently, there is only GenericType.java which implements SemanticType
>> and it
>> is a sort of 'tabular' data. But, nothing stops you to add more or more
>> complex
>> SemanticType: for example, you could rapresent graph data instead of
>> tables, or
>> go one level up and represent people, cars, etc. or one level up and
>> represent
>> knowledge domains such as: "food" or "sport".
>>
>> To conclude, the pivotal approach to data conversion/integration keeps
>> the costs
>> of adding new serialization formats or new data formats low and
>> manageable. Each
>> time you add a new data format the overall value of your integration
>> software
>> grows (quadratically?).
>>
>> This approach can be applied independently from RDF or OWL, there is
>> nothing
>> magic with RDF or OWL. However RDF gives you a powerful and flexible data
>> model
>> which can be easily adopted at the core of such systems and OWL (as well
>> as
>> SPARQL or other tools such as SPIN) gives you powerful ways to transform
>> your data.
>>
>> Something that was a thankless can become almost pleasant. ;-)
>>
>> Paolo
>>
>>  [1] http://en.wikipedia.org/wiki/**Data_conversion#Pivotal_**conversion<http://en.wikipedia.org/wiki/Data_conversion#Pivotal_conversion>
>>  [2] https://github.com/castagna/**babel2/<https://github.com/castagna/babel2/>(feel free to fork it, if you find it
>> useful and send pull request if you improve it)
>>
>
>

Re: Pivotal data conversion/integration... (Was: Mapping Ontologies)

Posted by Rodrigo Jardim <ro...@gmail.com>.

Paolo,
your tips were very useful for me.

Thanks very much

--
Rodrigo



Hi Paolo,


Em 05/04/2012 18:37, Paolo Castagna escreveu:
> Hi Bernie
>
> Bernie Greenberg wrote:
>> [...] are you trying to "union" two or more web knowledge databases
>> representing parts of the same knowledge? I found this a thankless task.
> 'thankless' is my new word today. :-)
>
> To understand what you mean, I needed to go to a common place for the English
> language (i.e. a dictionary) and read the definition (which fortunately, uses
> words I already know).
>
> I agree with you on the adjective, it is thankless.
>
> RDF itself in relation to information|data|knowledge integration do not offer
> IMHO particular advantages on a 'semantic' level, in particular if|when people
> use different vocabularies|schema|ontologies. RDF provides help for merging
> datasets at a sort of 'syntactical' level, that is trivial (and it gives you
> time to think about the 'semantic' :-)). If the data you need to merge is using
> same vocabulary|schema|ontology you are almost done. Otherwise, you are left on
> your own, practically. This is just my humble opinion.
>
> By the way, people often disagree on how to model the same thing or how to map
> between two ontologies (or translate between two languages)... or how to name
> the same thing with the same name (or URI) or on the notion of "same thing".
> Trying to automate these tasks is thankless^2.
>
> In relation to data integration/conversion, one approach I think works very well
> is what Wikipedia calls 'pivotal conversion' [1]. Data integration and data
> conversion between N different formats (or N different languages) is an N^2
> problem. But, it can be reduced to an linear one simply adopting a core/common
> language. English for humans, TCP/IP for Internet, ? for data.
>
> With a pivotal data conversion/integration approach, it's very cheap to add a
> new format to your system, in particular if it is possible to transform from one
> format into another without loosing information. You just need to convert
> from/to a common format only. If you do that automatically you gain the
> conversion from/to all the other formats in the system.
>
> Why more and more people speak English? Because everybody else does it and this
> is the easiest way to communicate with everybody else. Unfortunately, human
> language is not as precise as other type of communication formats, when you go
> back and forward you lose information and translating from one language to
> another is not a precise process.
>
> RDF as well as OWL ontologies can be used in this way as core/common data
> format. This is easier on a syntactic level and it can become harder and
> imprecise as the expressive power of your language grows. However, you can still
> map external OWL ontologies to your own view of the world, your own internal
> core ontology. When you do that, your RDF toolbox has tools which allow you to
> translate RDF data described with an external ontology in data you can easily
> integrate and transform into other ontologies.
>
> To make things less abstract, here are three IMHO good examples of pivotal data
> integration|conversion:
>
>   - Hojoki: Make All Your Cloud Apps Work As One
>     http://hojoki.com/
>
>   - Open Services for Lifecycle Collaboration
>     http://open-services.net/ and http://eclipse.org/lyo/
>
>   - SIMILE | Babel
>     http://service.simile-widgets.org/babel/
>
> Hojoki is really cool and you can measure how fast they keep adding new
> services, each time adding more and more value for their users. For them, adding
> a new service is easy. A beautiful example of pivotal data/service integration.
>
> The video at the bottom of the http://eclipse.org/lyo/ page could have be done
> by Google (promoting RDF without ever mention it. ;-)). It made me remeber:
> http://www.youtube.com/watch?v=TJfrNo3Z-DU ... unfortunate IMHO Google bought
> them. I do not see the Freebase datadumps growing massively as they could (being
> Google). But, then... why sharing? Let's all give Google more data via
> schema.org and maybe they'll give it back... in HTML :-/ Ops, ...
>
> Babel is not 'active' anymore AFAICT. I did not want to let it die, so I've
> stolen it and put it on GitHub [2] (also it is using Apache Jena now). It's much
> more limited as I've spent only a few hours on it.
>
> You just have two interfaces to implement to add a new tabular data format:
> https://github.com/castagna/babel2/blob/master/apis/src/main/java/org/apache/jena/babel2/BabelReader.java
> https://github.com/castagna/babel2/blob/master/apis/src/main/java/org/apache/jena/babel2/BabelWriter.java
>
> The SemanticType.java interface is trying to capture the 'semantic' axis:
> https://github.com/castagna/babel2/blob/master/apis/src/main/java/org/apache/jena/babel2/SemanticType.java
> Currently, there is only GenericType.java which implements SemanticType and it
> is a sort of 'tabular' data. But, nothing stops you to add more or more complex
> SemanticType: for example, you could rapresent graph data instead of tables, or
> go one level up and represent people, cars, etc. or one level up and represent
> knowledge domains such as: "food" or "sport".
>
> To conclude, the pivotal approach to data conversion/integration keeps the costs
> of adding new serialization formats or new data formats low and manageable. Each
> time you add a new data format the overall value of your integration software
> grows (quadratically?).
>
> This approach can be applied independently from RDF or OWL, there is nothing
> magic with RDF or OWL. However RDF gives you a powerful and flexible data model
> which can be easily adopted at the core of such systems and OWL (as well as
> SPARQL or other tools such as SPIN) gives you powerful ways to transform your data.
>
> Something that was a thankless can become almost pleasant. ;-)
>
> Paolo
>
>   [1] http://en.wikipedia.org/wiki/Data_conversion#Pivotal_conversion
>   [2] https://github.com/castagna/babel2/ (feel free to fork it, if you find it
> useful and send pull request if you improve it)