You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Rupert Westenthaler <ru...@gmail.com> on 2011/08/01 10:55:37 UTC

Fwd: [jira] [Created] (STANBOL-303) EntityFetch engine

Hi

I am writing this on the list, because JIRA is not so prominent and
some interested might miss this discussion.

There are some comments inline and a longer discussion about the
pro/con at the end.

---------- Forwarded message ----------
From: Florent ANDRE (JIRA) <ji...@apache.org>
Date: Thu, Jul 28, 2011 at 9:22 PM
Subject: [jira] [Created] (STANBOL-303) EntityFetch engine
To: stanbol-commits@incubator.apache.org


EntityFetch engine
------------------

                Key: STANBOL-303
                URL: https://issues.apache.org/jira/browse/STANBOL-303
            Project: Stanbol
         Issue Type: Improvement
         Components: Enhancer
           Reporter: Florent ANDRE


> Hi,
>
> I extracted "entity fetching" related code from taxonomylinking engine and create a new engine based on.

What do you use as input for this Engine? see (1)

> I also make the query.addSelectedField() configurable by felix configuration.
>

+1

> This engine is runnable in ServiceProperties.ORDERING_EXTRACTION_ENHANCEMENT position.
>
Entity lookup might need additional metadata that is currently not
present in the metadata see (2)

> I see 2 advantages of such an engine :
> 1) users can develop an extraction engine without think about entity retrieve

That a biggest PRO argument for splitting up Text Analysis and Entity
Lookup in two engines and using the ContentItem.getMatadaa() as
abstraction layer!

> 2) if this engine provide helpful lib, entity fetching will easily be embed into user's engine and limit code duplication for entity fetch.

Providing only a tailored API / library for entity fetching as
typically needed by enhancement engines. see (3)

> Could it be an interesting engine for trunk ?
> ++

First let me say, that is was not my plan to keep all this
functionality within the TaxonomyLinkingEngine. The Idea was to start
with implementing everything in a singe Class to validate the approach
((performance and result wise) and if the results are promising to
refactor the implementation more generic.

In fact I have already started with making the TextAnalysis part more
generic. Basically building a simple API for TextAnalysis based on
OpenNLP that is tailored for the needs of Enhancements engines. This
will be part of the org.apache.stanbol.commons.opennlp bundle.

As mentioned by this issue the same would make sense for the Entity
retrieving part. So a big +1 from my side.


(1) TextAnalyses results:

The amount of data resulting form the text analyses is very different.
If you use NER (Named Entity Recognition) you get only a limited
number of Results that can easily converted to an RDF graph and added
to the metadata of the ContentItem.
However if you want to use Words, POS tagging and Chunker the amount
of the resulting information is much higher. Encoding all this as RDF
and adding it to the metadata may have performance and usability
implications.

Assuming a text with 2000 words one could expect 20 TextEnhancements
when using NER but 200+ Chunks with 500+ words with interesting POS
tags. Performance wise this will make the processing of the metadata
in follow up engines slower but it will also require to provide some
functionality  - post processing engine - that allows filter most of
such enhancements before sending the results back to the user.

If both text analyses and entity lookup are done in the same engine it
is much easier to optimize. e.g. processes the TaxonomyLinkingEngine
the content sentence wise. Therefore only the text analysis results of
the current sentence need to be kept in memory and TextAnnotations are
only created for words/chunks that are linked to an Entity.

(2) Using the Taxonomy to improve TextAnalyses results

First tests (with english language) has shown that POS tagging works
very well, but the performance of the Chunker is questionable. In
general building chunks manually based on POS tags worked much better
in most of the cases. Based on that I assume that in most of the cases
the best approach would be to

1. use Words and POS tags as input
2. build chunks proposals based on the POS tags
3. lookup Entities with all nouns of the proposed chunk. All such
nouns would be optional (this was the reason for implementing
STANBOL-297)
4. based on the returned Entities search for the best match in the
surrounding text (even outside of the proposed chunk)

However to implement 4 the entity fetching part would need access to
the results of the word tokenizer (2000 TextAnnotations for a document
with 2000 words).


(3) APIs for TextAnalyses and EntityFetching tailored to the
requirements of EnhancementEngine Developers

Because of this my conclusion was that it would be the best to first
work on APIs that ease the development of Engines that

* need to analyze natural language text
* need to lookup entities from the entityhub

So basically in case of the TaxonomyLinking Engine there would still
be only a single Engine but the amount of code would be greatly
reduced because it can use the tailored  API for TextAnalyses and
EntityFetching.
In addition the NER engine (enhancer.engines.opennlp.ner) should be
also changed to use the new TextAnalyses API and the
NamedEntityTagging engine (enhancer.engines.entitytagging) should use
the EntityFetching API.

Basically this would mean that such APIs would support the development
of both engines that do both text analyses and entity lookup as well
as engines that do only text analyses or entity lookup.

I have planed to work on the text analyses part in the coming time
however because I am be on vacation the whole August I would not
expect immediate results ^^

(4) Improving Metadata infrastructure

In my opinion the best solution would be to split up text analyses and
entity fetching in separate engines. However this would require to
improve the way metadata are handled by the Enhancer infrastructure.

This would include:

* processing of chunks (e.g. pages, sections, sentences ...) to reduce
the amount of data for big documents. This would also improve the
processing of big documents. Have you ever tried to send a PDF with
80+ pages to the Enhancer?
* filtering of enhancements so that users do not get enhancements that
are only interesting during the enhancement process (unless the do not
explicitly specify to get such intermediate results.

WDYT
Rupert


-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen