You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Rupert Westenthaler <ru...@gmail.com> on 2013/09/27 15:51:51 UTC

Re: FST Linking Engine (STANBOL-1128)

Hi all

with an update to the FST Linking engine.

In the last weeks I have invested a lot of time in further
improvements of the Engine and contributions to the  upstream project
SolrTextTagger [1].
Since STANBOL-1153 Solr Schema used the Entityhub are compatible with
the FST linking engine. That means that Entityhub Sites using the new
schema can be used with the FST linking engine. With STANBOL-1155 also
the dbpedia default data index is compatible with the FST linking
engine.

The FST linking engine now also supports to recalculate FST models
after changes to the SolrCore (e.g. when you add/update/delete an
Entity of an Entityhub ManagedSite). However note that this is an
expensive operation that can take some time. During the creation the
engine still uses the old model. Recreation is done in lowest priority
background threads.

.
The engine depends on SolrTextTagger 1.2-SNAPSHOT. Until Pull Request
17 [4] is merged the build requires to use my branch [5]. I hope that
David Smiley agrees with me that version 1.2 can soon be released and
added to maven central. When this is done I will include the FST
linking engine to the default build, the launchers and also will add
integration tests for it. Unit tests are already present.

Because the engine is now also easier to use for "custom vocabularies"
- vocabularies with typically 10k-500k entities I have made some
benchmark test that do compare the FST linking engine with current
EntityLinkingEngine.

While the test with Freebase (36 million entities) have shown a 5
times better performance. The performance gains for such smaller
vocabularies where in the area of 50-100 times faster. The reason for
that is that FST linking can be done fully in memory for vocabularies
of that size, while the Solr Query based EntityLinkingEngine does see
only minimal performance gains for smaller vocabularies.

See the detailed test results below:

On Fri, Aug 23, 2013 at 6:18 PM, Rupert Westenthaler
<ru...@gmail.com> wrote:
> Initial Performance Tests:
> ----
>
> I performed a Test on my MacBook Pro Core i7 2.6GHz, SSD with sending
> 5k dbpedia long abstracts with 10 concurrent threads with the Enhancer
> Stress Test Tool [3] to chains that included Language detection,
> OpenNLP Token, Sentence and POS tagging and
>
> (A) FST linking engine configured for Freebase with a Document Cache
> size of 1 million  vs.
> (B) EntityLinking engine also configured for freebase.
>
> with
>
> (A) average of 70ms for FST linking (with 100% CPU)
> (B) average of 390ms for EntityLinking
>
> when doing the test with ProperNoun linking deactivated (basically
> also linking Common Nouns to simulate longer texts) it gives the
> following results:
>
> (A) average of 267ms for FST linking (with 100% CPU)
> (B) average of 1417ms for EntityLinking
>
> In both cases the FST linking engine is about 5 times faster as the
> currently used EntityLinking engine.
>

Made some additional tests with smaller Vocabularies. Especially those
where all Entities can be cached in the LRU cache for SolrDocuments.

Setup:

* Hardware Setup was the exact same as for the initial tests.
* 10k dbpedia long abstracts with 10 concurrent threads
* Vocabulary: the dbpedia default data index (~25k entities).
* Label: Instead of "rdfs:label" linking was dome against
"dbpedia-ont:surfaceForm". This property is containing the label of
the Entity as well as all labels of Redirects to that entity

(A) FST linging with a cache that can hold all Entities (a feasible
config for vocabularies with less as 1 million entities)
(B) EntityLinking engine

With ProperNoun Linking configuration:

(A) average of 5ms for FST linking
(B) average of 339ms for EntityLinking

When configured to link all Nouns

(A) average of 7ms for FST linking
(B) average of 994ms for EntityLinking


best
Rupert

>
>
> [1] https://github.com/OpenSextant/SolrTextTagger/
> [2] http://svn.apache.org/repos/asf/stanbol/trunk/enhancement-engines/lucenefstlinking/README.md
> [3] http://stanbol.apache.org/docs/trunk/utils/enhancerstresstest
[4] https://github.com/OpenSextant/SolrTextTagger/pull/17
[5] https://github.com/westei/SolrTextTagger

>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen