You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@stanbol.apache.org by Rafa Haro <rh...@zaizi.com> on 2013/01/30 12:16:26 UTC

Entity Disambiguation in Stanbol

Dear all,

Lately, as Apache Stanbol integrators, we have been widely working in 
Zaizi with Enhancement Engines that allows not only to link entities 
with Knowledge Bases (mainly DBpedia) but also to disambiguate them. As 
you know, currently, there are two engines in Stanbol that can be used 
for disambiguation purposes: disambiguation-mlt [1], developed by 
Kritarth Anand as part of a GSOC project supervised by Rupert, and 
DBpedia Spotlight [2], contributed by Pablo Mendes and Iavor Jelev as 
part of the Early Adopters programme [3] and currently in the trunk 
integrated within a Enhancement Chain called dbpedia-spotlight.

Currently, while dbpedia-spotlight Enhancement Chain can be used 
"out-of-the-box" even with local installations of DBpedia Spotlight, 
it's difficult to configure and get running disambiguation-mlt engine. 
Also, after spend some days of testing with this engine, we found that 
the results weren't very good. So, after a couple of discussions with 
Rupert and also with the feedback of one of our customers, we concluded 
that it would be necessary to go far with disambiguation engines in 
Stanbol and we decided to start working in a complete new Disambiguation 
Framework that would allow also to perform disambiguation with custom 
vocabularies and knowledge bases.

We wanted to propose in the list a first approach to a roadmap for 
disambiguation in Stanbol. In our opinion, a high-level list of tasks 
that should be done is the following:

- Agree a disambiguation index model to store entities' surface forms 
and disambiguation contexts independent of the Knowledge Base, enabling 
also disambiguation with custom vocabularies.

- Design and develop tools for building such indexes, including an 
specific one for DBpedia - Wikipedia.

- Maintain disambiguation-mlt as a baseline disambiguation algorithm and 
adapt it to work with the new designed index. Adapt it to work with last 
Enhancer Release and merge it with the trunk in SVN.

- Design and develop new disambiguation algorithms based on entities 
co-occurrence, graph representations and statistical models.


Any comment, feedback or ideas are more than welcome!!!

Regards

[1] - https://github.com/kritarthanand/Disambiguation-Stanbol
[2] - https://github.com/dbpedia-spotlight/dbpedia-spotlight
[3] - 
http://blog.iks-project.eu/dbpedia-spotlight-integration-in-apache-stanbol-2/ 


-- 

------------------------------
This message should be regarded as confidential. If you have received this 
email in error please notify the sender and destroy it immediately. 
Statements of intent shall only become binding when confirmed in hard copy 
by an authorised signatory.

Zaizi Ltd is registered in England and Wales with the registration number 
6440931. The Registered Office is 222 Westbourne Studios, 242 Acklam Road, 
London W10 5JJ, UK.

Re: Entity Disambiguation in Stanbol

Posted by Rupert Westenthaler <ru...@gmail.com>.

On Wed, Jan 30, 2013 at 4:51 PM, Rafa Haro <rh...@zaizi.com> wrote:
> Thanks Rupert for your valuable feedback and contributions!
>
> El 30/01/13 16:03, Rupert Westenthaler escribió:
>
>> As those algorithm will be the main source for requirements on the
>> disambiguation index model we might need to investigate this while
>> designing the disambiguation index model.
>
>
> It would be very great to have a brainstorming session for that. I can point
> out a lot of papers related to disambiguation topic, although most of them
> are focused on disambiguation against Wikipedia, being difficult to find
> papers proposing a most generic approach.
>

I would start with some typical scenarios:

* SKOS like controlled vocabulary (such as gemet [1])
* Knowledge base based on Company data (we could e.g. start from the
data model used by Sugar CRM [2])
* Assuming a Scenario where Users do use Stanbol to annotate content
and manually correct suggestions (The annotate.js [3] use case). So
the input for disambiguation is the original knowledge base plus the
mentions of those in manually corrected content of the user.

Based on those scenarios it should be possible to evaluate if and how
we could acquire the data required by the algorithms.

WDYT
Rupert


[1] http://www.eionet.europa.eu/gemet/about?langcode=en
[2] http://www.sugarcrm.com/company-overview
[3] http://szabyg.github.com/annotate.js/

> Regards
>
>
>
> --
>
> ------------------------------
> This message should be regarded as confidential. If you have received this
> email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard copy
> by an authorised signatory.
>
> Zaizi Ltd is registered in England and Wales with the registration number
> 6440931. The Registered Office is 222 Westbourne Studios, 242 Acklam Road,
> London W10 5JJ, UK.



--
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Entity Disambiguation in Stanbol

Posted by Rafa Haro <rh...@zaizi.com>.

Thanks Rupert for your valuable feedback and contributions!

El 30/01/13 16:03, Rupert Westenthaler escribió:
> As those algorithm will be the main source for requirements on the
> disambiguation index model we might need to investigate this while
> designing the disambiguation index model.

It would be very great to have a brainstorming session for that. I can 
point out a lot of papers related to disambiguation topic, although most 
of them are focused on disambiguation against Wikipedia, being difficult 
to find papers proposing a most generic approach.

Regards


-- 

------------------------------
This message should be regarded as confidential. If you have received this 
email in error please notify the sender and destroy it immediately. 
Statements of intent shall only become binding when confirmed in hard copy 
by an authorised signatory.

Zaizi Ltd is registered in England and Wales with the registration number 
6440931. The Registered Office is 222 Westbourne Studios, 242 Acklam Road, 
London W10 5JJ, UK.

Re: Entity Disambiguation in Stanbol

Posted by Rupert Westenthaler <ru...@gmail.com>.

Hi Rafa

Great the see some movement on the disambiguation topic.

On Wed, Jan 30, 2013 at 12:16 PM, Rafa Haro <rh...@zaizi.com> wrote:
> We wanted to propose in the list a first approach to a roadmap for
> disambiguation in Stanbol. In our opinion, a high-level list of tasks that
> should be done is the following:
>
> - Agree a disambiguation index model to store entities' surface forms and
> disambiguation contexts independent of the Knowledge Base, enabling also
> disambiguation with custom vocabularies.
>

I would also like if such a model would support temporal and spatial
contexts in addition to the lexical context (surface forms), full text
contexts (mentions) and the formal context (relations to/from the
Entity in the knowledge base).

> - Design and develop tools for building such indexes, including an specific
> one for DBpedia - Wikipedia.
>

I would not limit this to DBpedia but also consider using information
from freebase and yago for that task.

> - Maintain disambiguation-mlt as a baseline disambiguation algorithm and
> adapt it to work with the new designed index. Adapt it to work with last
> Enhancer Release and merge it with the trunk in SVN.
>

Updating the disambiguation-mlt branch to the now released versions of
Commons, Enhancer and Entityhub is on my TODO list. I plan also to
work on the few remaining issues to make the engine releasable (mainly
improving the engines configuration).

> - Design and develop new disambiguation algorithms based on entities
> co-occurrence, graph representations and statistical models.
>

As those algorithm will be the main source for requirements on the
disambiguation index model we might need to investigate this while
designing the disambiguation index model.

Thanks Rafa for taking up this important topic!

best
Rupert

>
> Any comment, feedback or ideas are more than welcome!!!
>
> Regards
>
> [1] - https://github.com/kritarthanand/Disambiguation-Stanbol
> [2] - https://github.com/dbpedia-spotlight/dbpedia-spotlight
> [3] -
> http://blog.iks-project.eu/dbpedia-spotlight-integration-in-apache-stanbol-2/
>
> --
>
> ------------------------------
> This message should be regarded as confidential. If you have received this
> email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard copy
> by an authorised signatory.
>
> Zaizi Ltd is registered in England and Wales with the registration number
> 6440931. The Registered Office is 222 Westbourne Studios, 242 Acklam Road,
> London W10 5JJ, UK.

--
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen