You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by "Rupert Westenthaler (JIRA)" <ji...@apache.org> on 2011/09/07 12:27:10 UTC

[jira] [Commented] (STANBOL-303) EntityFetch engine

    [ https://issues.apache.org/jira/browse/STANBOL-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098833#comment-13098833 ] 

Rupert Westenthaler commented on STANBOL-303:
---------------------------------------------

Hi Florent

Yesterday I have started a 3rd attempt to implement the TaxonomyLinkingEngine in a modular fashion.
Up to now this looks much better as the 1st (the current version as in the SVN) and the 2nd.

The basic Idea of this is similar to what is described by this Issue. The main component processes through the text that is already analyzed by the TextAnalyzer [1] and looks-up Entities via "Taxonomy" interface. I will provide a default implementation for the Taxonomy interface based on the Entityhub, but one could also provide an implementation based on an in-memory representation (e.g. for smaller Taxonomies).

The following features will be supported:
  - finding Entities with multiple words (e.g. "Apache Stanbol", "Rupert Westenthaler")
  - excluding Entities with multiple words if only a single Word matches (e.g. "Apache Stabol" and "Apache Sling" for "Apache"; "Ruper Westenthaler" and "Rupert Murdoch" for "Rupert"). 
  - support for POS (Part-of-Speech) tags: e.g. look-up only Nouns - if users are interested in Named Entities, Concpets ... ;  look-up only Verbs - as required for an Engine as described by STANBOL-322. The presence of POS tags in the Analyzed Content is optional. If no POS tags are available, than all words need to be processed.
 - support for Chunks: Skip words outside of chunks; Skip/Process chunks based on type. The presence of Chunks tags in the Analyzed Content is optional. If no Chunks are available than no words of the text can be skipped.

My current plan is to commit this code within the TaxonomyLinkingEngine bundle, but in the end it will be the best to create an own module out of it - such a EntityFetch engine. 

As soon as I am ready to commit an first version (hopefully in the coming days) I will post an update here.

best
Rupert Westenthaler

[1] http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/opennlp/src/main/java/org/apache/stanbol/commons/opennlp/TextAnalyzer.java

> EntityFetch engine
> ------------------
>
>                 Key: STANBOL-303
>                 URL: https://issues.apache.org/jira/browse/STANBOL-303
>             Project: Stanbol
>          Issue Type: Improvement
>          Components: Enhancer
>            Reporter: Florent ANDRE
>
> Hi,
> I extracted "entity fetching" related code from taxonomylinking engine and create a new engine based on.
> I also make the query.addSelectedField() configurable by felix configuration.
> This engine is runnable in ServiceProperties.ORDERING_EXTRACTION_ENHANCEMENT position.
> I see 2 advantages of such an engine : 
> 1) users can develop an extraction engine without think about entity retrieve
> 2) if this engine provide helpful lib, entity fetching will easily be embed into user's engine and limit code duplication for entity fetch.
> Could it be an interesting engine for trunk ?
> ++

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira