You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Rupert Westenthaler <ru...@gmail.com> on 2012/11/21 16:55:30 UTC

The KeywordLinkingEngine and the Stanbol NLP processing module (STANBOL-740)

Hi all,

After about two month the Stanbol NLP module (STANBOL-733) is ready to
be merged with the Trunk. As part of this work the
KeywordLinkingEngine was adapted to make use of the new features
(STANBOL-740). However as soon as those things will be merged back
with the trunk this will affect currently used Stanbol configurations.

Currently the minimal Enhancement Chain used for the
KeywordLinkingEngine looks like follows

    {language-detection}
    {keyword-linking}

where the KeywordLinkingEngine does all of tokenizing, sentence
detection, POS tagging, Chunking and finally the linking with the
vocabulary.

With the Stanbol NLP module this will change. The KeywordLinkingEngine
in the stanbol-nlp-processing branch is now only concerned with the
linking task. All the text processing steps are done by other
EnhancementEngines. However this also means that the minimal/typical
Enhancement Chain used for the KeywordLinkingEngine changes quite
dramatically.

    {language-detection}
    {sentence-detection} (optional)
    {tokenizing}
    {phrase-detection} (optional)
    {keyword-linking}

So even that the actual configurations for the trunk version of the
KeywordLinkingEngine do still work with the branch version user will
need to adapt their EnhancementChain configurations as soon as the
Stanbol NLP processing is re-integrated with the trunk.

So basically there are two options to deal with that:

(1) Reintegrate the KeywordLinkingEngine and break existing
Enhancement Chains: While this will affect most Stanbol users it will
be easily recognized because the used Chains will no longer provide
the expected results. The fix is also relatively easy, because current
chains would only needed to be extended by the four new OpenNLP based
NLP processing engines.

    {possible-other-engines-like-tika}
    {langauge-detection}
    opennlp-sentence
    opennlp-token
    opennlp-pos
    opennlp-chunker
    {keyword-linking}
    {possible-other-engines-like-refactor}

(2) Change the name (and artifactid) of the KeywordLinkingEngine in
the branch and reintegrate it as an new Engine (e.g. as
EntityLinkingEngine). The KeywordLinkingEngine in the trunk would than
be deprecated and after the next release of Stanbol moved to the
/contrib folder). While this would ensure that current configurations
would not become invalid it would also make it likely that Stanbol
users would keep using an outdated engine. In additions users would
need to adapt all KeywordLinkingEngine configurations to the new
EntityLinkingEngine.

Personally I see advantages and disadvantages in both Solutions and do
not have a clear preference. So I would really appreciate feedback
regarding this

best
Rupert


--
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: The KeywordLinkingEngine and the Stanbol NLP processing module (STANBOL-740)

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi all

The refactoring is completed (for now) - see STANBOL-812 [1].
Documentation is already online on the Staging Server

* EntityhubLinkingEngine [2]: This is the direct successor of the
KeywordlinkingEngine
* EntityLinkingEngine [3]: This is the "generic" implementation of
EntityLinking based on the NLP processing API [4]

There will be a 2nd refactoring step to make the EntityLinkingEngine
fully independent of the Stanbol Entityhub. But this will not have any
influence on public APIs, Chain configurations nor Enhancement results
so this can be done after reintegration with the trunk.

Thanks for the feedback
best
Rupert

[1] https://issues.apache.org/jira/browse/STANBOL-812
[2] http://stanbol.staging.apache.org/docs/trunk/components/enhancer/engines/entityhublinking
[3] http://stanbol.staging.apache.org/docs/trunk/components/enhancer/engines/entitylinking
[4] http://stanbol.staging.apache.org/docs/trunk/components/enhancer/nlp/

On Thu, Nov 22, 2012 at 1:05 PM, Rupert Westenthaler
<ru...@gmail.com> wrote:
> On Thu, Nov 22, 2012 at 12:10 PM, Bertrand Delacretaz
> <bd...@apache.org> wrote:
>>
>> Isn't the "hub" part an implementation detail?
>>
>> EntityLinkingEngine sounds better to be - but no strong opinion,
>> whoever does the work decides.
>
> Good point. While refactoring the code I came to the same conclusion
>
> Currently I have
>
> (1) "EntityLinkingEngine": This is the class implementing the
> EnhancementEngine interface and in registered as OSGI service and
> (2) "EntityhubLinkingEngine": The OSGI Component that gets the
> configuration, registered an ServiceTracker for the Entityhub Site and
> registers the  "EntityLinkingEngine" instance as soon as all the
> required Services are available.
>
> The goal of this is to make it really easy implement a
> "MyServiceLinkingEngine". Even my current refactoring we are not yet
> there, but it is getting much better.
>
> best
> Rupert
>
>
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: The KeywordLinkingEngine and the Stanbol NLP processing module (STANBOL-740)

Posted by Rupert Westenthaler <ru...@gmail.com>.
On Thu, Nov 22, 2012 at 12:10 PM, Bertrand Delacretaz
<bd...@apache.org> wrote:
>
> Isn't the "hub" part an implementation detail?
>
> EntityLinkingEngine sounds better to be - but no strong opinion,
> whoever does the work decides.

Good point. While refactoring the code I came to the same conclusion

Currently I have

(1) "EntityLinkingEngine": This is the class implementing the
EnhancementEngine interface and in registered as OSGI service and
(2) "EntityhubLinkingEngine": The OSGI Component that gets the
configuration, registered an ServiceTracker for the Entityhub Site and
registers the  "EntityLinkingEngine" instance as soon as all the
required Services are available.

The goal of this is to make it really easy implement a
"MyServiceLinkingEngine". Even my current refactoring we are not yet
there, but it is getting much better.

best
Rupert



--
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: The KeywordLinkingEngine and the Stanbol NLP processing module (STANBOL-740)

Posted by Bertrand Delacretaz <bd...@apache.org>.
On Thu, Nov 22, 2012 at 10:12 AM, Fabian Christ
<ch...@googlemail.com> wrote:
...
> +1 for EntityhubLinkingEngine...

Isn't the "hub" part an implementation detail?

EntityLinkingEngine sounds better to be - but no strong opinion,
whoever does the work decides.

-Bertrand

Re: The KeywordLinkingEngine and the Stanbol NLP processing module (STANBOL-740)

Posted by Fabian Christ <ch...@googlemail.com>.
+1 for option 2) + creating a tag before the merge as Bertrand suggested.

+1 for EntityhubLinkingEngine


2012/11/22 Rupert Westenthaler <ru...@gmail.com>

> Hi
>
> thanks for the feedback. I think we should go for (2) renaming the
> engine. First because the current name (KeywordExtractionEngine) is
> anyway not so fitting. Keyword extraction is typically more related to
> finding central words within a text but the engine is more about
> linking words with a vocabulary. Second because there might be some
> use cases where it would still make sense to use the old engine in
> parallel with the new one - e.g for extracting Product-Ids, ISBN
> numbers, chemical formulas such as CH3CH2OH ... Third it is easier to
> adapt the documentation - especially the usage scenarios - if there is
> a new name for the new engine and finally I do also like to have
> warnings instead of errors for users that have not yet adapted to the
> new engine.
>
> While Fabians suggestion would clearly document the change it would
> still mean to break most current Stanbol installations as most of the
> users currently use the trunk version. However as soon as we do have a
> faster release cycle this option would be much more attractive.
>
> I would than suggest to use "EntityhubLinkingEngine" as the new name
> for the Engine as this name makes it very clear what this engine does.
>
> Thanks for the feedback
> best
> Rupert
>
>
> On Thu, Nov 22, 2012 at 12:01 AM, Bertrand Delacretaz
> <bd...@apache.org> wrote:
> > On Wed, Nov 21, 2012 at 8:46 PM, Fabian Christ
> > <ch...@googlemail.com> wrote:
> >> ...what about creating a branch from the trunk with the current version
> >> (before the merge) that is known to be working? People could switch to
> that
> >> branch to keep the status quo and we should make clear that this branch
> >> will not be maintained in the future...
> >
> > I'd make that just a BEFORE_740 tag then - that makes it clearer that
> > this is not supposed to evolve further.
> >
> > -Bertrand
>
>
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>



-- 
Fabian
http://twitter.com/fctwitt

Re: The KeywordLinkingEngine and the Stanbol NLP processing module (STANBOL-740)

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi

thanks for the feedback. I think we should go for (2) renaming the
engine. First because the current name (KeywordExtractionEngine) is
anyway not so fitting. Keyword extraction is typically more related to
finding central words within a text but the engine is more about
linking words with a vocabulary. Second because there might be some
use cases where it would still make sense to use the old engine in
parallel with the new one - e.g for extracting Product-Ids, ISBN
numbers, chemical formulas such as CH3CH2OH ... Third it is easier to
adapt the documentation - especially the usage scenarios - if there is
a new name for the new engine and finally I do also like to have
warnings instead of errors for users that have not yet adapted to the
new engine.

While Fabians suggestion would clearly document the change it would
still mean to break most current Stanbol installations as most of the
users currently use the trunk version. However as soon as we do have a
faster release cycle this option would be much more attractive.

I would than suggest to use "EntityhubLinkingEngine" as the new name
for the Engine as this name makes it very clear what this engine does.

Thanks for the feedback
best
Rupert


On Thu, Nov 22, 2012 at 12:01 AM, Bertrand Delacretaz
<bd...@apache.org> wrote:
> On Wed, Nov 21, 2012 at 8:46 PM, Fabian Christ
> <ch...@googlemail.com> wrote:
>> ...what about creating a branch from the trunk with the current version
>> (before the merge) that is known to be working? People could switch to that
>> branch to keep the status quo and we should make clear that this branch
>> will not be maintained in the future...
>
> I'd make that just a BEFORE_740 tag then - that makes it clearer that
> this is not supposed to evolve further.
>
> -Bertrand



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: The KeywordLinkingEngine and the Stanbol NLP processing module (STANBOL-740)

Posted by Bertrand Delacretaz <bd...@apache.org>.
On Wed, Nov 21, 2012 at 8:46 PM, Fabian Christ
<ch...@googlemail.com> wrote:
> ...what about creating a branch from the trunk with the current version
> (before the merge) that is known to be working? People could switch to that
> branch to keep the status quo and we should make clear that this branch
> will not be maintained in the future...

I'd make that just a BEFORE_740 tag then - that makes it clearer that
this is not supposed to evolve further.

-Bertrand

Re: The KeywordLinkingEngine and the Stanbol NLP processing module (STANBOL-740)

Posted by Fabian Christ <ch...@googlemail.com>.
Hi,

what about creating a branch from the trunk with the current version
(before the merge) that is known to be working? People could switch to that
branch to keep the status quo and we should make clear that this branch
will not be maintained in the future. From this branch we could even cut
releases as 0.10.0 components.

Then go with option 1) and make as soon as possible a 1.0.0 release to mark
this milestone.

- Fabian



2012/11/21 Rupert Westenthaler <ru...@gmail.com>

> Hi all,
>
> After about two month the Stanbol NLP module (STANBOL-733) is ready to
> be merged with the Trunk. As part of this work the
> KeywordLinkingEngine was adapted to make use of the new features
> (STANBOL-740). However as soon as those things will be merged back
> with the trunk this will affect currently used Stanbol configurations.
>
> Currently the minimal Enhancement Chain used for the
> KeywordLinkingEngine looks like follows
>
>     {language-detection}
>     {keyword-linking}
>
> where the KeywordLinkingEngine does all of tokenizing, sentence
> detection, POS tagging, Chunking and finally the linking with the
> vocabulary.
>
> With the Stanbol NLP module this will change. The KeywordLinkingEngine
> in the stanbol-nlp-processing branch is now only concerned with the
> linking task. All the text processing steps are done by other
> EnhancementEngines. However this also means that the minimal/typical
> Enhancement Chain used for the KeywordLinkingEngine changes quite
> dramatically.
>
>     {language-detection}
>     {sentence-detection} (optional)
>     {tokenizing}
>     {phrase-detection} (optional)
>     {keyword-linking}
>
> So even that the actual configurations for the trunk version of the
> KeywordLinkingEngine do still work with the branch version user will
> need to adapt their EnhancementChain configurations as soon as the
> Stanbol NLP processing is re-integrated with the trunk.
>
> So basically there are two options to deal with that:
>
> (1) Reintegrate the KeywordLinkingEngine and break existing
> Enhancement Chains: While this will affect most Stanbol users it will
> be easily recognized because the used Chains will no longer provide
> the expected results. The fix is also relatively easy, because current
> chains would only needed to be extended by the four new OpenNLP based
> NLP processing engines.
>
>     {possible-other-engines-like-tika}
>     {langauge-detection}
>     opennlp-sentence
>     opennlp-token
>     opennlp-pos
>     opennlp-chunker
>     {keyword-linking}
>     {possible-other-engines-like-refactor}
>
> (2) Change the name (and artifactid) of the KeywordLinkingEngine in
> the branch and reintegrate it as an new Engine (e.g. as
> EntityLinkingEngine). The KeywordLinkingEngine in the trunk would than
> be deprecated and after the next release of Stanbol moved to the
> /contrib folder). While this would ensure that current configurations
> would not become invalid it would also make it likely that Stanbol
> users would keep using an outdated engine. In additions users would
> need to adapt all KeywordLinkingEngine configurations to the new
> EntityLinkingEngine.
>
> Personally I see advantages and disadvantages in both Solutions and do
> not have a clear preference. So I would really appreciate feedback
> regarding this
>
> best
> Rupert
>
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>



-- 
Fabian
http://twitter.com/fctwitt

Re: The KeywordLinkingEngine and the Stanbol NLP processing module (STANBOL-740)

Posted by Bertrand Delacretaz <bd...@apache.org>.
Hi,

On Wed, Nov 21, 2012 at 4:55 PM, Rupert Westenthaler
<ru...@gmail.com> wrote:

>... (2) Change the name (and artifactid) of the KeywordLinkingEngine in
> the branch and reintegrate it as an new Engine (e.g. as
> EntityLinkingEngine). The KeywordLinkingEngine in the trunk would than
> be deprecated and after the next release of Stanbol moved to the
> /contrib folder)...

I like that, it's basically backwards compatible and minimizes confusion.

The deprecation of the existing KeywordLinkingEngine could be
expressed by a WARN log message that points users to the new and
improved engine.

-Bertrand