You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Rupert Westenthaler <ru...@gmail.com> on 2011/03/01 15:45:51 UTC

Redesigning the Entityhub Configuration

Based on my own experience and feedback of some (very) friendly early
adopters the main issue with the Stanbol Entityhub is the complex
configuration. This is something that is ok to try it out in the lab,
but becomes more and more an show stopper because it keeps people away
from using it.
In short - the goal must be to make the Entityhub as easily
installable as the Stanbol Enhancer. That means providing an runable
jar that can be used without any additional configuration steps.
The main focus of my work in the coming weeks will to achieve exactly
that for the Stanbol Entityhub.

Current State:

Currently the configuration of the Stanbol Entityhub is a very complex
thing to do because It requires to configure a lot of OSGI components
and to connect them to each other. This has the following
shortcomings:
 - The users need to know how to configure components (e.g. to use the
SPARQL Endpoint for the SPARQL Dereferencer and the LOD Endpoint for
the Cool URI Dereferencer when configuring a Referenced Site)
 - The user need to remember the IDs of the components to correctly
relate them with each other
 - The user need to understand all the components and there role to
even know that he has to connect them with each other
 - The user needs to keep track of dependent components when changing
the configuration of an component
 - The Apache Felix Web Console is a cool interface to configure
single components, but does not really support the management of such
dependencies
 - Some Components (e.g. the SolrYard) also requires to connect to an
external Service or to point to specific files on the local hard disc.

To give an example here are the stepse needed to setup geonames.org as
ReferencedSite by using a local index instead of the remote SPARQL
Endpoint:
 1) download or create an local index of this site
 2) set up a SolrServer or providing the Solr index + configuration
needed to run an EmbeddedSolrServer (described by [1])
 3) configure a SolrYard instance that points to the SolrServer (also
described by [1])
 4) configure a Cache instance and connect it to the configured
SolrYard (no documentation)
 5) create a ReferencedSite instance that uses the cache and - as
fallback - also an EntityDereferencer to be used as fallback if the
cache is not available (described by [2]).

Even to set up the Entityhub with a minimal configuration one need to
complete the following three steps (as described by [3]):
 1) create and configure a Yard used by the Entityhub to store its data
 2) configuration of the Entityhub (especially linking it to the Yard
created in step 1
 3) configure at least one ReferencedSite (because without any
referenced site there will be no Entities to work with)

Getting this right might take the average user - even with a very good
documentation - several hours what is way to much for typical users
that plan to try a new technology.


Planed Changes:

(1) Automatic configuration of the Core Framework

This includes the configuration of the Entityhub, the Yard used by the
Entityhub to store its data and the Jersey Endpoint.

The Entityhub will come with a default configuration. In case no
configuration is present the default will be set via the OSGI
ConfigAdmin Service or by using default values for all required
properties.
The Yard instance required by the Entityhub need also to be
instantiated if not available. Here the plan is to use the Yard
implementation with the highest service rank. In case initialization
based on this implementation fails the implementation with the next
highest rank will be chosen until success.

(2) Configuration of ReferencedSites

Configuring ReferencedSites is tricky, because depending on the actual
configuration this requires to configure a lot of different components
and link them together.
The Current Idea is to support two options for configuring Referenced Sites:
 a) A configuration File: This should be the best in cases one does
not require a local cache with preloaded information.
 b) A Bundle (Archive) that contains not only the configuration but
also a local index.
For Both cases the Felix FileInstall can be used to dynamically load
(and initialize) the configuration as soon as the user copies or
updates it within a special directory.

As far as I know with (a) one can only provide the configuration for a
single component/config file. In that case one would need to define a
new component "e.g. ReferencedSiteConfig" that is responsible for
creating and configuring all the necessary components (ReferencedSite,
EntityDereferencer, EntitySearcher, Cache and Yard).
For (b) I think about using a BundleActivator that first inits the
files for the local index and than loads the configuration from within
the bundle and parses it to the ConfigAdmin. From that point the
initialization would be the same as for (a).

(3) Updates of local caches/indexes

The update of local indexes is an other important configuration task
that need to be done by users. Currently the Idea is to use the same
files as described for (2b). However in that cases the initialization
would need to detect existing configuration to don't override them
(only the index data would need to be updated, but not possible
changes to the configurations of the components)


Summary:

When finishing all this it should be possible to double click a
runable jar containing the Entityhub and immediately start to use it.
Adding new sites will be possible to download prepared configurations
and simple copy them into a configuration directory. For changing the
configuration of already installed ReferencedSites the Apache Felix
WebConsole is used.


So thats the plan as for now. If someone has any comments, tips,
experiences in implementing functionality like that, nice code
examples ... I would be very thankful! Most of that stuff is put
together based on examples I found on sites like
http://www.osgilook.com/ and I am still wondering if that is the way
to go.

best
Rupert Westenthaler

[1] http://wiki.iks-project.eu/index.php/SolrYardConfiguration
[2] http://wiki.iks-project.eu/index.php/ReferencedSiteConfiguration
[3] http://wiki.iks-project.eu/index.php/RickInstallation


-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Redesigning the Entityhub Configuration

Posted by Fabian Christ <ch...@googlemail.com>.
Hi,

2011/3/1 Rupert Westenthaler <ru...@gmail.com>:
> Adding new sites will be possible to download prepared configurations
> and simple copy them into a configuration directory. For changing the
> configuration of already installed ReferencedSites the Apache Felix
> WebConsole is used.

I like the idea of having pre-bundled entityhub configurations
available for download. If the configuration is really such a complex
task some experts can do that and we build up a library of available
configurations for end users.

Best,
 - Fabian

-- 
Fabian

Re: Redesigning the Entityhub Configuration

Posted by Fabian Christ <ch...@googlemail.com>.
Hi,

I just wondered if we also need a more user friendly UI for
configuring Stanbol. Having the OSGi console is nice for developers
but it is very technical and overloaded with OSGi stuff that you don't
have to understand if you just want to configure the entity hub.

My experience with (default) configuration in software products are these rules:

1) provide defaults that are fine for most use cases
2) provide an easy to use configuration interface
3) configure without requiring a system restart
4) have a strategy for updating configurations with new product versions

Just my 2 cents. IMO Rupert is on the right way here ;)

Best,
 - Fabian

2011/3/1 Olivier Grisel <ol...@ensta.org>:
> Hi Rupert and others,
>
> I agree with your analysis: it is very important to have a standalone
> launcher that is up and running with the following 4 commands:
>
> svn co http://svn.apache.org/repos/asf/incubator/stanbol/trunk/ stanbol
> cd standol
> mvn install
> java -jar launchers/standalone/target/org.apache.stanbol-*.jar -p 8080
>
> And then have a sample curl command to demonstrate the lookup by name,
> field, similarity with text context and sparql query (if supported by
> the default configuration).
>
> In order to package the default data to be included as part of such a
> "standalone" distribution of stanbol, I have already opened a jira
> issue:
>
>  https://issues.apache.org/jira/browse/STANBOL-90
>
> I also created a subtask dedicated to the packaging of the index of
> the entityhub:
>
>  https://issues.apache.org/jira/browse/STANBOL-92
>
> I also fixed the task for a first packaging of the opennlp models:
>
>  https://issues.apache.org/jira/browse/STANBOL-91
>
> This was implemented by adding a new toplevel component:
> org.apache.stanbol.defaultdata:
>
>  http://svn.apache.org/repos/asf/incubator/stanbol/trunk/defaultdata/
>
> As you can see the version number is fixed (not a -SNAPSHOT) to avoid
> downloading the data every time when not necessary. Also, currently
> this project is just a placeholder with a shell script to download the
> data from external websites (the sourceforge server for opennlp models
> for instance). A pre-built maven artifact including the data is
> currently made available through the nuxeo.org public maven repository
> at:
>
>  https://maven.nuxeo.org/nexus/content/repositories/vendor-releases/org/apache/stanbol/org.apache.stanbol.defaultdata/
>
> As for the packaging of the default OSGi configuration I do not really
> know. Maybe you should ask on the sling and felix user mailing lists?
>
> One potential solution would be to make the sling launcher deploy some
> default OSGi config files into the initial sling/config/ folder
> created at the first start of the launcher. I don't know how to do
> that though.
>
> The sling launcher documentation mentions the Configuration Admin
> Service that manages configuration updates for ManagedService or
> ManagedServiceFactory instances but it does not tell how to provide
> some default configuration files for such services.
>
> http://sling.apache.org/site/configuration.html
>
> --
> Olivier
>



-- 
Fabian

Re: Redesigning the Entityhub Configuration

Posted by Olivier Grisel <ol...@ensta.org>.
Hi Rupert and others,

I agree with your analysis: it is very important to have a standalone
launcher that is up and running with the following 4 commands:

svn co http://svn.apache.org/repos/asf/incubator/stanbol/trunk/ stanbol
cd standol
mvn install
java -jar launchers/standalone/target/org.apache.stanbol-*.jar -p 8080

And then have a sample curl command to demonstrate the lookup by name,
field, similarity with text context and sparql query (if supported by
the default configuration).

In order to package the default data to be included as part of such a
"standalone" distribution of stanbol, I have already opened a jira
issue:

  https://issues.apache.org/jira/browse/STANBOL-90

I also created a subtask dedicated to the packaging of the index of
the entityhub:

  https://issues.apache.org/jira/browse/STANBOL-92

I also fixed the task for a first packaging of the opennlp models:

  https://issues.apache.org/jira/browse/STANBOL-91

This was implemented by adding a new toplevel component:
org.apache.stanbol.defaultdata:

  http://svn.apache.org/repos/asf/incubator/stanbol/trunk/defaultdata/

As you can see the version number is fixed (not a -SNAPSHOT) to avoid
downloading the data every time when not necessary. Also, currently
this project is just a placeholder with a shell script to download the
data from external websites (the sourceforge server for opennlp models
for instance). A pre-built maven artifact including the data is
currently made available through the nuxeo.org public maven repository
at:

  https://maven.nuxeo.org/nexus/content/repositories/vendor-releases/org/apache/stanbol/org.apache.stanbol.defaultdata/

As for the packaging of the default OSGi configuration I do not really
know. Maybe you should ask on the sling and felix user mailing lists?

One potential solution would be to make the sling launcher deploy some
default OSGi config files into the initial sling/config/ folder
created at the first start of the launcher. I don't know how to do
that though.

The sling launcher documentation mentions the Configuration Admin
Service that manages configuration updates for ManagedService or
ManagedServiceFactory instances but it does not tell how to provide
some default configuration files for such services.

http://sling.apache.org/site/configuration.html

-- 
Olivier