You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Rupert Westenthaler <ru...@gmail.com> on 2015/06/18 10:00:43 UTC

IXA pipe nerc integration with Apache Stanbol

Hi Rodrigo

I took you mail from the contact section of the Github. Feel free to
forward this to the whole team if you like.

First I want to thank the whole IXA pipes team team for providing all
those extensions and especially high quality NER models for OpenNLP.
As a German speaking person I especially enjoy using the German NER
model a lot.

With this message I want let you know about the integration of the IXA
pipe nerc module with Apache Stanbol (see STANBOL-1422 [1]).

The mein contribution of this Integration is code that allows to use
your models for and extensions to OpenNLP to be used in applications
running within an OSGI environment (such as Apache Stanbol).

The reminder of this (long) mail will provide details on how to make
IXA pipes nerc wok within an OSGI environment.

- - -

OpenNLP as such does support OSGI. However for extensions to be
useable in OSGI they need as well support OSGI. As the documentation
of OpenNLP on how to achieve this is very sparse I will provide all
the necessary information for doing so in this mail based on the
example of integrating the `eus.ixa:ixa-pipe-nerc` module with Apache
Stanbol.

Typically OpenNLP will load those extensions via the Classpath baed on
information provided by the `manifest.properties` contained in the
`{model}.bin` file.

Here an example taken from the `it-clusters-evalita09.bin` file of the
1.5.0 release:

    Training-Eventhash=d41d8cd98f00b204e9800998ecf8427e
    Manifest-Version=1.0
    Language=it
    serializer-class-itwikiprecleantokc1000p1txt=eus.ixa.ixa.pipe.nerc.dict.BrownCluster$BrownClusterSerializer

The last line tells OpenNLp to use the class
`eus.ixa.ixa.pipe.nerc.dict.BrownCluster$BrownClusterSerializer`
implementing the `ArtifactSerializer` interface to load the artifact
with the name `itwikiprecleantokc1000p1txt` contained in the model.

When running within OSGI OpenNLP can not load such extensions via the
classpath because in OpenNLP modules do have only access to packages
they are explicitly importing. In OSGI Extensions are typically
handled by registering them as Services with the OSGI ServiceRegistry.
This is also the way how OpenNLP searches for extensions when it is
used within an OSGI framework.

To use the IXA NERC models within an OSGI environment one needs to
provide two things:

1. an OSGI Bundle
2. register the extensions as OSGI services.

(1) is done by be pom.xml file o.a.stanbol.commons.ixa-pipe-nerc
module [2]. The configuration of the `maven-bundle-plugin` does all
the work.

The `Import-Package` section defines ignored, optional and required
dependencies. I used this to ignore `net.sourceforge.argparse4j`; make
the dependencies to `org.jdom2` and the `opennlp.tools.cmdline`
package optional as those seam only to be required for training and
not during runtime.

The `Export-Package` section defines functionality this bundle
provides to other modules. Currently all `eus.ixa.ixa.pipe.nerc`
packages are exported. Note also that the `maven-bundle-plugin` will
copy all classes of exported packages to the bundle. By this way the
code provided by the `eus.ixa:ixa-pipe-nerc` is getting into the
bundle.

The `Private-Package` section defines functionality the stays private
to this bundle. This incudee the BundleActivator I implemented to
register your extensions as services (see below).

(2) The registration of the IXA pipes nerc extensions as OSGI service
is done by a BundleActivator. The class implemdntic the
BundleActivator is configured by the `Bundle-Activator` directive in
the `maven-bundle-plugin` configuration.

The BundleActorvator itself [3] is simple. It has two lifecycle method
`start(BundleContext context)` and `stop(BundleContext context)`. The
start method registers all the extensions as OSGI services. The stop
method unregisters them.

In OSGI services are registered for a 1..n names (typically the class
names of the service interfaces) and a dictionary of parameters.
Parameters can be used by Service consumers to query for Services.
OpenNLP requires to the the class name of the service implementation
as value of the `opennlp` parameter.

So a typical ServiceRegistration looks like

        Dictionary<String,Object> prop = new Hashtable<String,Object>();
        prop.put("opennlp", Word2VecClusterSerializer.class.getName());
        registeredServices.add(context.registerService(
                ArtifactSerializer.class.getName(),
                new Word2VecClusterSerializer(),
                prop));

Where `registeredServices` is a set holding references to all
registered service needed to unregister them in the
`stop(BundleContext context)` method.

With this in place it is now possible to successfully load all the IXA
pipe nerd models of the 1.5.0 release with OpenNLP when running in an
OSGI environment.

If you would like to directly support OSGI in one of your next
releases feel free to ask.  I am sure I can find some time to help out
with bringing this feature directly to IXA pipe nerc (and possible the
other) modules

best
Rupert


[1] https://issues.apache.org/jira/browse/STANBOL-1422
[2] http://svn.apache.org/repos/asf/stanbol/trunk/commons/ixa-pipe-nerc/pom.xml
[3] http://svn.apache.org/repos/asf/stanbol/trunk/commons/ixa-pipe-nerc/src/main/java/org/apache/stanbol/commons/ixa/pipe/nerc/Activator.java

-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                              ++43-699-11108907
| A-5500 Bischofshofen

Re: IXA pipe nerc integration with Apache Stanbol

Posted by Rodrigo Agerri <ra...@apache.org>.
Hello Rupert,

On Thu, Jun 18, 2015 at 10:00 AM, Rupert Westenthaler
<ru...@gmail.com> wrote:
> Hi Rodrigo
>
> I took you mail from the contact section of the Github. Feel free to
> forward this to the whole team if you like.

I am the team :)

>
> First I want to thank the whole IXA pipes team team for providing all
> those extensions and especially high quality NER models for OpenNLP.
> As a German speaking person I especially enjoy using the German NER
> model a lot.

Glad to hear, thanks.

>
> With this message I want let you know about the integration of the IXA
> pipe nerc module with Apache Stanbol (see STANBOL-1422 [1]).

I am very happy to hear that a nice project such as Stanbol is
interested in using it.

> The mein contribution of this Integration is code that allows to use
> your models for and extensions to OpenNLP to be used in applications
> running within an OSGI environment (such as Apache Stanbol).

I am also an Apache OpenNLP commiter. I develop/use the ixa pipes for
my academic research and our own university projects and I see
ixa-pipes as being ready to use tools whereas I see OpenNLP as a
library to create your own NLP projects via its API, which is what I
do for some of the pipes, namely, the NER. As you know, ixa-pipe-nerc
uses the machine learning components of OpenNLP (which are very nice
IMHO) to provide customized models built on top of the OpenNLP
infrastructure. Apart from this, whenever it is feasible I try to
commit any specific development of ixa pipes to OpenNLP, apart from
contributing in other ways, of course. For example, the clustering
features implemented in ixa-pipe-nerc and responsible for the good
performance of the models have been contributed to OpenNLP [1], [2],
[3]. Other aspects, such as the ixa-pipe-nerc dictionary features have
not been contributed because there are already DictionaryFeatures
(although different) implemented in OpenNLP.

To cut a long story short, I have not problem whatsoever in helping
with the OSGI to make it easier for Stanbol to integrate the
ixa-pipe-nerc models. However, and followed from what I said, I can
also train NERC models in OpenNLP native mode with a similar
configuration as the ones I distribute in ixa-pipe-nerc; "similar"
because some of features, namely, gazetteers are not implemented in
OpenNLP.

Please let me know which option would be more interesting for the
Apache Stanbol project.

[1] https://issues.apache.org/jira/browse/OPENNLP-714
[2] https://issues.apache.org/jira/browse/OPENNLP-715
[3] https://issues.apache.org/jira/browse/OPENNLP-716

Best,

Rodrigo