You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Rupert Westenthaler <ru...@gmail.com> on 2011/06/03 08:10:20 UTC

Generic RDF indexing configuration and indexing of IPTC news Codes (was: How to add an rdf/skos file in entity hub)

Hi

As part on the work at STANBOL-187 [1] I extended the Entityhub
Indexing tool with two new features:

1. The Indexing Utility now an OSGI bundle that contains all the
configurations needed to configure instances of the components
necessary to add the indexed dataset to the entityhub. Together with
the BundleInstaller extension [2] to the Sling Installer this allows
to install those Services easily by adding this bundle the an OSGI
environment running the Apache Stanbol Entityhub.

2. I am added a generic configuration for RDF files [3]. See the
README.md for more details. Especially It also includes useful
mappings  "mapping.txt" for a lot of common Ontologies (RDF, RDFS,
OWL, DC elements and terms, SKOS ...)

To test the generic RDF configuration I used the IPTC NewsCodes
(http://www.iptc.org/site/NewsCodes/). During that work I discovered
that the SKOS versions include several errors that need to be
corrected. In addition the usage of the "Accept-Language" header makes
it also challenging (for normal users) to download the controlled
vocabulary in all available languages.

So if someone is interested to use the IPTC NewsCodes you can download
this example form [4]. This archive contains:

* README.md with detailed information on the necessary changes to IPTC
SKOS files, the creation of the index and the installation of the
index
* indexing.properties file with the data for IPTC. For all the other
configurations the defaults of the generic RDF indexer are used.
* The corrected SKOS files for the six "Descriptive NewsCodes".

I have not (yet) added this to Stanbol, because I am not sure the
License used by the IPTC [5] is compatible with Apache and adding this
without the SKOS files makes not really sense as long as the original
versions contain the errors as mentioned above.

best
Rupert Westenthaler

[1] https://issues.apache.org/jira/browse/STANBOL-187
[2] https://issues.apache.org/jira/browse/STANBOL-140
[3| http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/indexing/genericrdf/
[4] http://www.salzburgresearch.at/~rwesten/stanbol/stanbol-entityhub-iptc-indexing-config.zip
[5] http://www.iptc.org/goto/ipp

---------- Forwarded message ----------
From: florent andré <fl...@4sengines.com>
Date: Wed, Jun 1, 2011 at 6:38 PM
Subject: Re: How to add an rdf/skos file in entity hub
To: stanbol-dev@incubator.apache.org


Sweeeeeeet ! :) Looking forward to try this out.

This will be a great tool !

PS : I found this dataset that can interest some of you :
http://www.mpi-inf.mpg.de/yago-naga/yago/

++

On 06/01/2011 04:40 PM, Rupert Westenthaler wrote:
>
> Hi
>
> Based on your Request I have worked the last two days on several
> improvements of the Indexing Tool.
> Most important the Indexing Util now directly creates a Bundle that
> when installed in the Entityhub will create all the necessary
> Entityhub components to use the Indexed RDF data as an Referenced Site
>
> I have also created the generic RDF configuration with a lot of
> additional documentation.
>
> I am currently working on some final things. So expect to see the
> stuff in the SVN tomorrow.
>
> best
> Rupert Westenthaler
>
> On Wed, Jun 1, 2011 at 10:32 AM, Olivier Grisel
> <ol...@ensta.org>  wrote:
>>
>> 2011/6/1 Florent André<fl...@apache.org>:
>>>
>>> Hi Rupert,
>>>
>>> Thanks for your valuables answers !
>>>
>>> In fact, if get it now, the meaning of indexing in entity hub is not just
>>> about index, but about create a new (offline) entity hub.
>>>
>>> You said :
>>>>
>>>> The Solr Yard provides better performance especially for big Datasets.
>>>
>>> ...
>>>>
>>>> The Clerezza  is fine for smaller data sets.
>>>
>>> Do you have a "magic number" (a vague will be fine :) ) that define the
>>> limit for a big dataset ?
>>
>> The SolrYard implementation should be pretty scalable (tens or
>> hundreds millions of entities). The ClerezzaYard will suffer from a
>> limitation though. It won't be scalable to more than a couple of
>> thousands of entities as long as the following is not fixed:
>>
>>  https://issues.apache.org/jira/browse/CLEREZZA-466
>>
>> --
>> Olivier
>> http://twitter.com/ogrisel - http://github.com/ogrisel
>>
>
>
>



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen