You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Rafa Haro <rh...@zaizi.com> on 2013/10/03 18:51:50 UTC

GSoC Projects and Entity Disambiguation Roadmap

Hi fellas,

With http://svn.apache.org/r1528907 the GSoC projects source code has 
been commited in a new branch that we have called "disambiguation". As 
you might know, this year, there were two proposals for Stanbol, both 
related to disambiguation engines. Dileepa Jayakody has developed an 
Entity Disambiguation Engine using FOAF Correlation (STANBOL-1161) and 
Antonio Perez a Graph-Based Freebase Disambiguation Engine 
(STANBOL-1156). AFAIK, the results of both projects will be published by 
Google next week, but according to the mentors they have successfully 
accomplish them. I would like to congrats both Antonio and Dileepa again 
for the good work. Please feel free to test both solutions. In order to 
do it properly, you need to go through READMEs documents because both 
projects use some external resources that need to be build.

Because both projects have several features in common, we have been 
discussing at Stanbol IRC channel about a Roadmap to refactor both 
projects and continue improving the disambiguation stuff in Stanbol. The 
summary of the proposed actions is the following:

1. Create an API that would allow to easily extract disambiguation 
features from the context (ContenItem + Annotations). This might include 
a better API to deal with Annotations and the results of previous engines.

2. Provide a framework for Session (local) disambiguation. The framework 
should allow to configure disambiguation features from custom sites and 
to plugin algorithms that use those features

3. Provide a Framework for Knowledge Based Disambiguation Algorithm. He 
have identified three types: Text Based (e.g. Solr MLT), Graph based and 
Machine Learning based. ML based are more complex to generalize, so we 
would discard it for now. For both text and graph based, we would need 
to create a framework for easing KBs storing/management. Typically, text 
based approaches would need to store textual contents and evidences for 
the entities. For example Wikilinks is a dataset of documents with 
mentions to Freebase entities that can be used as disambiguation 
evidences. Graph based approaches would need to use Graph databases in 
order to store the relationships between the entities and provide 
efficient ways to manipulate the graph and plugin graph based algorithms.

Looking Forward for your feedback.

Cheers,

Rafa Haro

-- 

------------------------------
This message should be regarded as confidential. If you have received this 
email in error please notify the sender and destroy it immediately. 
Statements of intent shall only become binding when confirmed in hard copy 
by an authorised signatory.

Zaizi Ltd is registered in England and Wales with the registration number 
6440931. The Registered Office is Brook House, 229 Shepherds Bush Road, 
London W6 7AN. 

Re: GSoC Projects and Entity Disambiguation Roadmap

Posted by Antonio David Perez Morales <ap...@zaizi.com>.
Hi Rupert

After reading the comments in the STANBOL-1183, I agree with the structure
you propose for the Disambiguation API.
As you say, we can start implementing the Entity Disambiguation Context
part of the API based on the Entityhub and then we can extend it with the
other implementations exposed in the issue.

Regards


On Mon, Oct 21, 2013 at 3:04 PM, Rupert Westenthaler <
rupert.westenthaler@gmail.com> wrote:

> Hi all,
>
> To give this a start I created STANBOL-1183 [1] and added a first
> suggestion for a disambiguation API.
>
> * the `Entity Disambiguation Context` is tailored towards the "Session
> (local) disambiguation" usage scenario.
> * the `DisambiguationData` resembles the class with the same name of
> the Disambiguation MLT engine that was already reused by this years
> GSoC projects.
> * the `DisambiguationContext` tries to abstract the building of
> contexts from the algorithm used for disambiguation. While some
> engines will come with both context and algorithm the hope is that
> other might be able to reuse context or algorithm implementations.
>
> Would be great to get some feedback about this proposal!
>
> Next Steps: Assuming some positive Feedback I would like to start with
> the Entity Disambiguation Context part of the API. Most likely I will
> start with a Entityhub Representation based implementation of
> `EntityContext` and a Entityhub Site based implementation of the
> `EntityContextProvider`. I will also create SolrYard indexes for
> datasets such as geonames.org and dbpedia for testing.
>
> best
> Rupert
>
>
> [1] https://issues.apache.org/jira/browse/STANBOL-1183
>
> On Fri, Oct 4, 2013 at 8:57 AM, Antonio David Perez Morales
> <ap...@zaizi.com> wrote:
> > Hi all
> >
> > Thanks for the support of the community (specially Rupert and Rafa)
> during
> > the project.
> >
> > I agree with all the conclusions from the discussion at Stanbol IRC so we
> > can define a definitive roadmap (for the time being) in order to start
> > develop these topics.
> >
> > Regards
> >
> >
> > On Thu, Oct 3, 2013 at 7:16 PM, Dileepa Jayakody
> > <di...@gmail.com>wrote:
> >
> >>
> >>
> >>
> >> On Thu, Oct 3, 2013 at 10:21 PM, Rafa Haro <rh...@zaizi.com> wrote:
> >>
> >>> Hi fellas,
> >>>
> >>> With http://svn.apache.org/r1528907 the GSoC projects source code has
> >>> been commited in a new branch that we have called "disambiguation". As
> you
> >>> might know, this year, there were two proposals for Stanbol, both
> related
> >>> to disambiguation engines. Dileepa Jayakody has developed an Entity
> >>> Disambiguation Engine using FOAF Correlation (STANBOL-1161) and Antonio
> >>> Perez a Graph-Based Freebase Disambiguation Engine (STANBOL-1156).
> AFAIK,
> >>> the results of both projects will be published by Google next week, but
> >>> according to the mentors they have successfully accomplish them. I
> would
> >>> like to congrats both Antonio and Dileepa again for the good work.
> >>
> >>
> >> Thanks all for the support and guidance given throughout the project, it
> >> was a great experience working with Stanbol community.
> >>
> >>
> >>> Please feel free to test both solutions. In order to do it properly,
> you
> >>> need to go through READMEs documents because both projects use some
> >>> external resources that need to be build.
> >>>
> >>> Because both projects have several features in common, we have been
> >>> discussing at Stanbol IRC channel about a Roadmap to refactor both
> projects
> >>> and continue improving the disambiguation stuff in Stanbol. The
> summary of
> >>> the proposed actions is the following:
> >>>
> >>> 1. Create an API that would allow to easily extract disambiguation
> >>> features from the context (ContenItem + Annotations). This might
> include a
> >>> better API to deal with Annotations and the results of previous
> engines.
> >>>
> >>
> >> +1, EntityAnnotation, TextAnnotation like abstractions are used for
> >> various purposes in disambiguation. Therefore creating Java classes and
> a
> >> API will be extremely useful.
> >>
> >>>
> >>> 2. Provide a framework for Session (local) disambiguation. The
> framework
> >>> should allow to configure disambiguation features from custom sites
> and to
> >>> plugin algorithms that use those features
> >>>
> >>> Can you please give some more details on this point?
> >> I guess it is a framework to plugin custom vocabularies and configure
> >> disambiguation from those vocabularies? Please correct me if I have got
> the
> >> idea wrong.
> >>
> >>
> >>> 3. Provide a Framework for Knowledge Based Disambiguation Algorithm. He
> >>> have identified three types: Text Based (e.g. Solr MLT), Graph based
> and
> >>> Machine Learning based. ML based are more complex to generalize, so we
> >>> would discard it for now. For both text and graph based, we would need
> to
> >>> create a framework for easing KBs storing/management. Typically, text
> based
> >>> approaches would need to store textual contents and evidences for the
> >>> entities. For example Wikilinks is a dataset of documents with
> mentions to
> >>> Freebase entities that can be used as disambiguation evidences. Graph
> based
> >>> approaches would need to use Graph databases in order to store the
> >>> relationships between the entities and provide efficient ways to
> manipulate
> >>> the graph and plugin graph based algorithms.
> >>>
> >>> +1.
> >>
> >>> Looking Forward for your feedback.
> >>>
> >>> Cheers,
> >>>
> >>> Rafa Haro
> >>>
> >>> Thanks,
> >> Dileepa
> >>
> >>> --
> >>>
> >>> ------------------------------
> >>> This message should be regarded as confidential. If you have received
> >>> this email in error please notify the sender and destroy it
> immediately.
> >>> Statements of intent shall only become binding when confirmed in hard
> copy
> >>> by an authorised signatory.
> >>>
> >>> Zaizi Ltd is registered in England and Wales with the registration
> number
> >>> 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road,
> >>> London W6 7AN.
> >>
> >>
> >>
> >
> > --
> >
> > ------------------------------
> > This message should be regarded as confidential. If you have received
> this
> > email in error please notify the sender and destroy it immediately.
> > Statements of intent shall only become binding when confirmed in hard
> copy
> > by an authorised signatory.
> >
> > Zaizi Ltd is registered in England and Wales with the registration number
> > 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road,
> > London W6 7AN.
>
>
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>

-- 

------------------------------
This message should be regarded as confidential. If you have received this 
email in error please notify the sender and destroy it immediately. 
Statements of intent shall only become binding when confirmed in hard copy 
by an authorised signatory.

Zaizi Ltd is registered in England and Wales with the registration number 
6440931. The Registered Office is Brook House, 229 Shepherds Bush Road, 
London W6 7AN. 

Re: GSoC Projects and Entity Disambiguation Roadmap

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi all,

To give this a start I created STANBOL-1183 [1] and added a first
suggestion for a disambiguation API.

* the `Entity Disambiguation Context` is tailored towards the "Session
(local) disambiguation" usage scenario.
* the `DisambiguationData` resembles the class with the same name of
the Disambiguation MLT engine that was already reused by this years
GSoC projects.
* the `DisambiguationContext` tries to abstract the building of
contexts from the algorithm used for disambiguation. While some
engines will come with both context and algorithm the hope is that
other might be able to reuse context or algorithm implementations.

Would be great to get some feedback about this proposal!

Next Steps: Assuming some positive Feedback I would like to start with
the Entity Disambiguation Context part of the API. Most likely I will
start with a Entityhub Representation based implementation of
`EntityContext` and a Entityhub Site based implementation of the
`EntityContextProvider`. I will also create SolrYard indexes for
datasets such as geonames.org and dbpedia for testing.

best
Rupert


[1] https://issues.apache.org/jira/browse/STANBOL-1183

On Fri, Oct 4, 2013 at 8:57 AM, Antonio David Perez Morales
<ap...@zaizi.com> wrote:
> Hi all
>
> Thanks for the support of the community (specially Rupert and Rafa) during
> the project.
>
> I agree with all the conclusions from the discussion at Stanbol IRC so we
> can define a definitive roadmap (for the time being) in order to start
> develop these topics.
>
> Regards
>
>
> On Thu, Oct 3, 2013 at 7:16 PM, Dileepa Jayakody
> <di...@gmail.com>wrote:
>
>>
>>
>>
>> On Thu, Oct 3, 2013 at 10:21 PM, Rafa Haro <rh...@zaizi.com> wrote:
>>
>>> Hi fellas,
>>>
>>> With http://svn.apache.org/r1528907 the GSoC projects source code has
>>> been commited in a new branch that we have called "disambiguation". As you
>>> might know, this year, there were two proposals for Stanbol, both related
>>> to disambiguation engines. Dileepa Jayakody has developed an Entity
>>> Disambiguation Engine using FOAF Correlation (STANBOL-1161) and Antonio
>>> Perez a Graph-Based Freebase Disambiguation Engine (STANBOL-1156). AFAIK,
>>> the results of both projects will be published by Google next week, but
>>> according to the mentors they have successfully accomplish them. I would
>>> like to congrats both Antonio and Dileepa again for the good work.
>>
>>
>> Thanks all for the support and guidance given throughout the project, it
>> was a great experience working with Stanbol community.
>>
>>
>>> Please feel free to test both solutions. In order to do it properly, you
>>> need to go through READMEs documents because both projects use some
>>> external resources that need to be build.
>>>
>>> Because both projects have several features in common, we have been
>>> discussing at Stanbol IRC channel about a Roadmap to refactor both projects
>>> and continue improving the disambiguation stuff in Stanbol. The summary of
>>> the proposed actions is the following:
>>>
>>> 1. Create an API that would allow to easily extract disambiguation
>>> features from the context (ContenItem + Annotations). This might include a
>>> better API to deal with Annotations and the results of previous engines.
>>>
>>
>> +1, EntityAnnotation, TextAnnotation like abstractions are used for
>> various purposes in disambiguation. Therefore creating Java classes and a
>> API will be extremely useful.
>>
>>>
>>> 2. Provide a framework for Session (local) disambiguation. The framework
>>> should allow to configure disambiguation features from custom sites and to
>>> plugin algorithms that use those features
>>>
>>> Can you please give some more details on this point?
>> I guess it is a framework to plugin custom vocabularies and configure
>> disambiguation from those vocabularies? Please correct me if I have got the
>> idea wrong.
>>
>>
>>> 3. Provide a Framework for Knowledge Based Disambiguation Algorithm. He
>>> have identified three types: Text Based (e.g. Solr MLT), Graph based and
>>> Machine Learning based. ML based are more complex to generalize, so we
>>> would discard it for now. For both text and graph based, we would need to
>>> create a framework for easing KBs storing/management. Typically, text based
>>> approaches would need to store textual contents and evidences for the
>>> entities. For example Wikilinks is a dataset of documents with mentions to
>>> Freebase entities that can be used as disambiguation evidences. Graph based
>>> approaches would need to use Graph databases in order to store the
>>> relationships between the entities and provide efficient ways to manipulate
>>> the graph and plugin graph based algorithms.
>>>
>>> +1.
>>
>>> Looking Forward for your feedback.
>>>
>>> Cheers,
>>>
>>> Rafa Haro
>>>
>>> Thanks,
>> Dileepa
>>
>>> --
>>>
>>> ------------------------------
>>> This message should be regarded as confidential. If you have received
>>> this email in error please notify the sender and destroy it immediately.
>>> Statements of intent shall only become binding when confirmed in hard copy
>>> by an authorised signatory.
>>>
>>> Zaizi Ltd is registered in England and Wales with the registration number
>>> 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road,
>>> London W6 7AN.
>>
>>
>>
>
> --
>
> ------------------------------
> This message should be regarded as confidential. If you have received this
> email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard copy
> by an authorised signatory.
>
> Zaizi Ltd is registered in England and Wales with the registration number
> 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road,
> London W6 7AN.



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: GSoC Projects and Entity Disambiguation Roadmap

Posted by Antonio David Perez Morales <ap...@zaizi.com>.
Hi all

Thanks for the support of the community (specially Rupert and Rafa) during
the project.

I agree with all the conclusions from the discussion at Stanbol IRC so we
can define a definitive roadmap (for the time being) in order to start
develop these topics.

Regards


On Thu, Oct 3, 2013 at 7:16 PM, Dileepa Jayakody
<di...@gmail.com>wrote:

>
>
>
> On Thu, Oct 3, 2013 at 10:21 PM, Rafa Haro <rh...@zaizi.com> wrote:
>
>> Hi fellas,
>>
>> With http://svn.apache.org/r1528907 the GSoC projects source code has
>> been commited in a new branch that we have called "disambiguation". As you
>> might know, this year, there were two proposals for Stanbol, both related
>> to disambiguation engines. Dileepa Jayakody has developed an Entity
>> Disambiguation Engine using FOAF Correlation (STANBOL-1161) and Antonio
>> Perez a Graph-Based Freebase Disambiguation Engine (STANBOL-1156). AFAIK,
>> the results of both projects will be published by Google next week, but
>> according to the mentors they have successfully accomplish them. I would
>> like to congrats both Antonio and Dileepa again for the good work.
>
>
> Thanks all for the support and guidance given throughout the project, it
> was a great experience working with Stanbol community.
>
>
>> Please feel free to test both solutions. In order to do it properly, you
>> need to go through READMEs documents because both projects use some
>> external resources that need to be build.
>>
>> Because both projects have several features in common, we have been
>> discussing at Stanbol IRC channel about a Roadmap to refactor both projects
>> and continue improving the disambiguation stuff in Stanbol. The summary of
>> the proposed actions is the following:
>>
>> 1. Create an API that would allow to easily extract disambiguation
>> features from the context (ContenItem + Annotations). This might include a
>> better API to deal with Annotations and the results of previous engines.
>>
>
> +1, EntityAnnotation, TextAnnotation like abstractions are used for
> various purposes in disambiguation. Therefore creating Java classes and a
> API will be extremely useful.
>
>>
>> 2. Provide a framework for Session (local) disambiguation. The framework
>> should allow to configure disambiguation features from custom sites and to
>> plugin algorithms that use those features
>>
>> Can you please give some more details on this point?
> I guess it is a framework to plugin custom vocabularies and configure
> disambiguation from those vocabularies? Please correct me if I have got the
> idea wrong.
>
>
>> 3. Provide a Framework for Knowledge Based Disambiguation Algorithm. He
>> have identified three types: Text Based (e.g. Solr MLT), Graph based and
>> Machine Learning based. ML based are more complex to generalize, so we
>> would discard it for now. For both text and graph based, we would need to
>> create a framework for easing KBs storing/management. Typically, text based
>> approaches would need to store textual contents and evidences for the
>> entities. For example Wikilinks is a dataset of documents with mentions to
>> Freebase entities that can be used as disambiguation evidences. Graph based
>> approaches would need to use Graph databases in order to store the
>> relationships between the entities and provide efficient ways to manipulate
>> the graph and plugin graph based algorithms.
>>
>> +1.
>
>> Looking Forward for your feedback.
>>
>> Cheers,
>>
>> Rafa Haro
>>
>> Thanks,
> Dileepa
>
>> --
>>
>> ------------------------------
>> This message should be regarded as confidential. If you have received
>> this email in error please notify the sender and destroy it immediately.
>> Statements of intent shall only become binding when confirmed in hard copy
>> by an authorised signatory.
>>
>> Zaizi Ltd is registered in England and Wales with the registration number
>> 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road,
>> London W6 7AN.
>
>
>

-- 

------------------------------
This message should be regarded as confidential. If you have received this 
email in error please notify the sender and destroy it immediately. 
Statements of intent shall only become binding when confirmed in hard copy 
by an authorised signatory.

Zaizi Ltd is registered in England and Wales with the registration number 
6440931. The Registered Office is Brook House, 229 Shepherds Bush Road, 
London W6 7AN. 

Re: GSoC Projects and Entity Disambiguation Roadmap

Posted by Dileepa Jayakody <di...@gmail.com>.
On Thu, Oct 3, 2013 at 10:21 PM, Rafa Haro <rh...@zaizi.com> wrote:

> Hi fellas,
>
> With http://svn.apache.org/r1528907 the GSoC projects source code has
> been commited in a new branch that we have called "disambiguation". As you
> might know, this year, there were two proposals for Stanbol, both related
> to disambiguation engines. Dileepa Jayakody has developed an Entity
> Disambiguation Engine using FOAF Correlation (STANBOL-1161) and Antonio
> Perez a Graph-Based Freebase Disambiguation Engine (STANBOL-1156). AFAIK,
> the results of both projects will be published by Google next week, but
> according to the mentors they have successfully accomplish them. I would
> like to congrats both Antonio and Dileepa again for the good work.


Thanks all for the support and guidance given throughout the project, it
was a great experience working with Stanbol community.


> Please feel free to test both solutions. In order to do it properly, you
> need to go through READMEs documents because both projects use some
> external resources that need to be build.
>
> Because both projects have several features in common, we have been
> discussing at Stanbol IRC channel about a Roadmap to refactor both projects
> and continue improving the disambiguation stuff in Stanbol. The summary of
> the proposed actions is the following:
>
> 1. Create an API that would allow to easily extract disambiguation
> features from the context (ContenItem + Annotations). This might include a
> better API to deal with Annotations and the results of previous engines.
>

+1, EntityAnnotation, TextAnnotation like abstractions are used for various
purposes in disambiguation. Therefore creating Java classes and a API will
be extremely useful.

>
> 2. Provide a framework for Session (local) disambiguation. The framework
> should allow to configure disambiguation features from custom sites and to
> plugin algorithms that use those features
>
> Can you please give some more details on this point?
I guess it is a framework to plugin custom vocabularies and configure
disambiguation from those vocabularies? Please correct me if I have got the
idea wrong.


> 3. Provide a Framework for Knowledge Based Disambiguation Algorithm. He
> have identified three types: Text Based (e.g. Solr MLT), Graph based and
> Machine Learning based. ML based are more complex to generalize, so we
> would discard it for now. For both text and graph based, we would need to
> create a framework for easing KBs storing/management. Typically, text based
> approaches would need to store textual contents and evidences for the
> entities. For example Wikilinks is a dataset of documents with mentions to
> Freebase entities that can be used as disambiguation evidences. Graph based
> approaches would need to use Graph databases in order to store the
> relationships between the entities and provide efficient ways to manipulate
> the graph and plugin graph based algorithms.
>
> +1.

> Looking Forward for your feedback.
>
> Cheers,
>
> Rafa Haro
>
> Thanks,
Dileepa

> --
>
> ------------------------------
> This message should be regarded as confidential. If you have received this
> email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard copy
> by an authorised signatory.
>
> Zaizi Ltd is registered in England and Wales with the registration number
> 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road,
> London W6 7AN.