You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Rupert Westenthaler <ru...@gmail.com> on 2012/10/30 12:51:11 UTC
Re: Apache Stanbol ( Disambiguation Engine ) proposal and doubts

Hi Juan,

On Mon, Oct 29, 2012 at 11:08 AM, Juan Vargas <ju...@appstylus.com> wrote:
>
> Hello Rupert,
>
> I'm Juan Vargas, a partner of Jairo Sarabia, the guy that few days ago
> posted some doubts in:
> https://issues.apache.org/jira/browse/STANBOL-723#comment-13483432
>

I am currently working on updating the branch so that is again
compatible with the trunk. I expect this to be finished today. Will
update STANBOL-723 as soon as I am finished

> I'm part of the Notedlinks team as a developer, where we are really
> interested in the disambiguation texts, and therefore, on the project you're
> working on. We are working with Stanbol since several weeks, and a few
> months, with similar products (Spotlight, Wikipedia Miner), but the problem
> always comes with disambiguation text. We understand that is complex, but we
> are very interested to have test Stanbol.

Disambiguation is for sure a very demanding area. The
disambiguation-mlt engine represents only a first step of Stanbol into
this big domain.

The intension of the "disambiguation-mlt" engine is to provide
disambiguation support for custom vocabularies. It is intended to
operate

* on "weak" contextual information. We do not expect Stanbol Users
that bring their own vocabularies to have a lot of information
available that can be used for disambiguation - at least not in the
beginning
* in domains with few ambiguities. Situations we want to support
include things like two persons in the CRM management do have the same
name; detect if an acronym refers to an internal project or to some
external entity (that is not in the linked vocabulary). In the first
case the disambiguation-mlt should recommend the correct person (e.g.
based on the company also mentioned in the text). In the second case
the disambiguation-mlt engine should greatly decrease the confidence
if it things the acronym does not represent the internal entity.

It is not expected that the disambiguation-mlt engine performs
especially well on general purpose datasets (such as dbpedia.org) as
the context it can use for disambiguation is rather limited at the
moment. If you want to disambiguate DBpedia entities than you should
expect better results with DBpedia Spotlight (BTW you can use SBpedia
Spotlight via Stanbol - see STANBOL-706)

> So, we wonder, how we can help, either by a budget, or a mention in our
> project, or the way you consider on the set-up an installation of Stanbol on
> our servers. Also, would be great if you can share with us a link to check
> and test the results of a text using the disambiguation software you are
> working on. It will help us to understand better how it will be the result
> once it's available. Do you have a date for the first final version roll
> out?

Currently the  "disambiguation-mlt engine" is good for demonstration
purpose and for early adopter testing. Code wise the main thing needed
for a "role out" of this engine is to make it configurable. Currently
most of the parameters you would want to tweak by configurations are
still set to the defaults in the code.

The main thing still pending is the validation of the approach based
on typical datasets: Upon now validation was done mainly based dbpedia
using MLT with the context over the full text from the DBpedia
entities (basically the abstract). I would really like to do a
validation that uses MLT based on URIs (basically disambiguate URIs of
other suggested Entities in the page with URIs referenced by Entities
in the KnowledgeBase). I am about to create an dbpedia 3.8 based index
that includes the required information and I am really eager to see
the results.

However as I mentioned above DBpedia is not the intended usage
scenario for this engine. Because of that it would be important to
make more tests with domain specific datasets. Especially
disambiguation of Concepts defined by SKOS thesauri are an interesting
use case. As those typically only define labels, an optional
description and semantic relations to broader, narrower and related
entities there is only very little context available. So I assume this
as a very hard setting.

Also other often used schemas (e.g. FOAF, schema.org ...) would be
good to test and finally it would be cool do have a set of benchmarks
so that we can use the Stanbol Benchmarking tool (STANBOL-138) to
validate disambiguation results (e.g. during integration tests).

Any help - code contributions / validations - are very welcome! If you
plan to use Apache Stanbol a mention on your webpage is very welcome.
If you plan to setup/maintain a server running disambiguation server
please announce so on the dev mailing list.

Personally I do plan to invest time in working on disambiguation in
the coming month. However currently my main focus is on NLP processing
(STANBOL-733).
As mentioned earlier I will create a Entityhub indexes for dbpedia 3.8
as part of this I will also try to create a version that works well
with the disambiguation-mlt engine. As soon as this is finished I can
also provide this demo on the http://dev.iks-project.eu server.

best
Rupert

> Thanks a lot for your attention. We hope to hear from you.
>
> Regards,
> Juan.




--
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen