You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@stanbol.apache.org by Rupert Westenthaler <ru...@gmail.com> on 2012/03/09 14:03:46 UTC

Re: DBpedia Spotlight in Stanbol

Hi Pablo, all

I tired to have a more detailed look at dbpedia spotlight. I was not
as successful as hoped with the source ( maybe because I have never
used Scala myself) but I got a relive nice overview by studying the
developer documentation.

I will split this in two parts: First some general points about
possible integration possibilities between Stanbol and dbpedia
spotlight and second some more specific points/ideas.

General points:

1) Integration of dbpedia spotlight as Enhancement Engine within
Apache Stanbol should be relatively easy. Especially when implemented
on the Web Service (RESTful) layer. I would suggest to start with
exactly this.

2) An integration on the level of "Spotter", "Candidate Selection",
"Disambiguating" and "Filtering" would be really nice as this would
allow to share intermediate results in-between EnhancementEngines.
E.g. to allow dbpedia spotlight to disambiguate TextAnnotations
extracted by other engines. Also allowing engines to further process
Entities spotted by dbpedia spotlight would be an interesting use
case.

3) It would be really interesting to use functionality of the dbpedia
spotlight Data Generation Workflow for the Entityhub Indexing tools
for DBPedia. Especially the correct processing of Redirects and
Disambiguation would be very helpful for building better Entityhub
indexes for DBPedia (to give an example: this would e.g. allow to add
abbreviations as alternate labels).

Specific points:

* I have seen that you directly use Lucene. Have you considered to use
Solr instead or do you need some features that are not available in
Solr?

Stanbol provides a lot of utility to easily deploy an manage SolrCores
(see [1] for details). So if it would be possible to use (not
indexing) dbpedia spotlight via solr this could allow things like
allowing users to very easily download and deploy a local dbpedia
spotlight version.

* Ling Pipe Spotter: I like the Idea of doing Entity Spotting by a
predefined list of terms but I do not like the license of this
dependency and also the memory food print.

I was thinking that given an Solr Index with the Surface Forms one
could implement a similar functionality by using
http://wiki.apache.org/solr/TermsComponent. I was already trying to
use this for the KeywordLinkingEngine, but it did not work out because
the KeywordLinkingEngine does not distinguish between (spotting +
candidate selection) and disambiguation (BTW: this is clearly a big
advantage of the design used by dbpedia spotlight over the
KeywordLinkingEngine.

The Idea would be to store the labels (Surface Forms) and the type in
a single SolrField such as

paris|Place
paris hilton|Person

The TermsComponent supports prefix searches and returns matching
labels. So a search for "paris" would return the two examples above.
The Results could than be post processed to retrieve the actual type
of the spotted Entity. If needed one could even encode the local name
of the Resource URL as suffix to such labels.

I had not nearly enough time to go into everything that looked
interesting but I hope that this helps at least to start the
discussion about the next steps.

best
Rupert Westenthaler

On Wed, Feb 29, 2012 at 6:41 AM, Rupert Westenthaler
<ru...@gmail.com> wrote:
> Hi Pablo, all
>
> I am personally really interested in DBpedia Spotlight and a nice
> integration in Stanbol would be really great. While the Stanbol
> Enhancer and Entityhub can already be used to link entities with
> dbpedia such components are not optimized to be used with DBpedia. In
> short - there are still a lot of reasons why one would want a
> dedicated Entity-Linking engine for DBpedia!
>
> Regarding the validation part: It would be really great if you would
> work on Benchmarking. Bertrand Delacretaz has implemented this really
> important feature more than a year ago but it was not really used up
> to now, because nobody contributed test data. For interested people
> you can try Benchmarking under "/benchmark" (e.g. on the demo server
> [1])
>
> I have already checked out the Spotlight source code and I will try to
> have a detailed look at it. I plan to provide detailed feedback on
> technical details with a focus on potential integration paths and
> synergies.
>
> Thx for the really nice proposal
>
> regards
> Rupert
>
>
> [1] http://dev.iks-project.eu:8081/benchmark/
>
> On Mon, Feb 27, 2012 at 12:30 PM, Pablo Mendes <pa...@gmail.com> wrote:
>> Hi all,
>> We are interested in joining the Early Adopters Programme (EAP) as a way to
>> seed a long lasting collaboration with the Stanbol community.
>>
>> We are the creators of DBpedia Spotlight, a Java/Scala Open Source
>> Enhancement Engine (Apache V2 license) that is complementary to Stanbol.
>> DBpedia Spotlight has the ambitious goal to annotate any of the 3.5M
>> entities from all 320 classes in the DBpedia Ontology. At the core of our
>> proposal is the idea of remaining generic and configurable for many use
>> cases. Besides the open source code, we also provide a freely available
>> REST service that has been used to annotate cultural goods [1], generate
>> RDFa annotations in Wordpress [2], and enhance the content in Wikipedia
>> through a MediaWiki toolbar [3], among others [4].
>>
>> [1] http://dme.ait.ac.at/annotation
>> [2] http://aksw.org/Projects/RDFaCE
>> [3] http://pedia.sztaki.hu/
>> [4] More at: http://wiki.dbpedia.org/spotlight/knownuses
>>
>> We have a demo interface that lets you tweak some parameters and see how
>> the system works in practice:
>> http://spotlight.dbpedia.org/demo
>>
>> As a first step through the EAP, shall our proposal be selected, our
>> intention is to provide Stanbol enhancement engines based on the different
>> strategies that DBpedia Spotlight uses for term recognition and
>> disambiguation (more technical details below). For the validation part, one
>> idea is to provide a benchmark comparing the perfomance (esp. accuracy) of
>> the different enhancement engines in different annotated corpora that we
>> have already collected. Would this be interesting for IKS/Stanbol? Is there
>> another type of validation that would be more appealing to the community?
>>
>> Looking forward to discussing possibilities with you.
>>
>> Best regards,
>> Pablo
>>
>> For the More Technical Folks
>>
>> Our content enhancement is performed in 4 stages:
>> - Spotting recognizes terms in some input text. It can be done via
>> substring matches in a dictionary, or with more sophisticated approaches
>> such as NER and keyphrase extraction.
>> - Candidate mapping matches the "spotted" terms with their possible
>> interpretations (entity identifiers). This can also be done with a
>> dictionary (hashmap), but offers the possibility to do fancier matching
>> with name variations - acronyms, approximate matching, etc.
>> - Disambiguation ranks the "candidates" given the context (e.g. words
>> around the spotted phrase). This can also be done in many ways, locally,
>> globally, with different scoring functions, etc.
>> - Linking decides which of the spots to keep, given that after the previous
>> steps we have more information about confidence, topical pertinence, etc.
>>
>> Other potentially interesting more technical details
>> - Our Web service uses Jersey (JAX-RS)
>> - The Web Service is CORS-enabled, and we have both pure JS and jQuery
>> clients. We also have Java, Scala and PHP clients.
>> - Users can provide SPARQL queries to blacklist/whitelist results
>> (currently in the Linking step only, but work in progress for other steps).
>
>
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen

-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: DBpedia Spotlight in Stanbol

Posted by Pablo Mendes <pa...@gmail.com>.

Hi Rupert, all

On Fri, Mar 9, 2012 at 2:03 PM, Rupert Westenthaler <
rupert.westenthaler@gmail.com> wrote:

> Hi Pablo, all
>
> I tired to have a more detailed look at dbpedia spotlight. I was not
> as successful as hoped with the source ( maybe because I have never
> used Scala myself) but I got a relive nice overview by studying the
> developer documentation.
>

I'm sorry to hear it wasn't as smooth as we would have liked. But I'm glad
that the documentation helps.
We recommend using IntelliJ for its Scala plugin, as many people have
problems with Scala support in other IDEs.
Our Jenkins tells us that the build is stable, so did you manage to at
least build with "mvn install"?
Let me know if you want to try again and perhaps I could help somehow.


>
> I will split this in two parts: First some general points about
> possible integration possibilities between Stanbol and dbpedia
> spotlight and second some more specific points/ideas.
>
>
> General points:
>
> 1) Integration of dbpedia spotlight as Enhancement Engine within
> Apache Stanbol should be relatively easy. Especially when implemented
> on the Web Service (RESTful) layer. I would suggest to start with
> exactly this.
>

Right. That was my plan too.


>
> 2) An integration on the level of "Spotter", "Candidate Selection",
> "Disambiguating" and "Filtering" would be really nice as this would
> allow to share intermediate results in-between EnhancementEngines.
> E.g. to allow dbpedia spotlight to disambiguate TextAnnotations
> extracted by other engines. Also allowing engines to further process
> Entities spotted by dbpedia spotlight would be an interesting use
> case.
>

Exactly!


>
> 3) It would be really interesting to use functionality of the dbpedia
> spotlight Data Generation Workflow for the Entityhub Indexing tools
> for DBPedia. Especially the correct processing of Redirects and
> Disambiguation would be very helpful for building better Entityhub
> indexes for DBPedia (to give an example: this would e.g. allow to add
> abbreviations as alternate labels).
>
>
OK. I've been cleaning up that workflow, and would be glad to find ways to
incorporate to EntityHub.


>
> Specific points:
>
> * I have seen that you directly use Lucene. Have you considered to use
> Solr instead or do you need some features that are not available in
> Solr?
>

We actually hacked pretty deep into Lucene in order to implement our
Inverse Candidate Frequency metric (
http://wiki.dbpedia.org/spotlight/isemantics2011).
Since we learned Lucene while creating this metric, the implementation
turned out to be less than optimal. We have a few directions we'd like to
explore in order to substitute that component.
One of them is using Solr. We are also taking into consideration key-value
stores or even regular SQL databases.


> Stanbol provides a lot of utility to easily deploy an manage SolrCores
> (see [1] for details). So if it would be possible to use (not
> indexing) dbpedia spotlight via solr this could allow things like
> allowing users to very easily download and deploy a local dbpedia
> spotlight version.
>
>
Ah, that sounds great! We need something like that. We've considered using
Maven and even CKAN for distributing our data.
We'd have to see the pros and cons. For example, we have other data besides
the lucene/solr index.
Anyways, for the next release I plan to segment our gigantic index into
many cohesive bunches "people, places, organizations", "sports, music", etc.
That way people should be able to compose their index with more ease.


>
> * Ling Pipe Spotter: I like the Idea of doing Entity Spotting by a
> predefined list of terms but I do not like the license of this
> dependency and also the memory food print.
>

We're on the same page. We've implemented 5-6 other spotters after that
one, and it's within our plans to completely deprecate that one.


>
> I was thinking that given an Solr Index with the Surface Forms one
> could implement a similar functionality by using
> http://wiki.apache.org/solr/TermsComponent. I was already trying to
> use this for the KeywordLinkingEngine, but it did not work out because
> the KeywordLinkingEngine does not distinguish between (spotting +
> candidate selection) and disambiguation (BTW: this is clearly a big
> advantage of the design used by dbpedia spotlight over the
> KeywordLinkingEngine.
>

As part of our participation in the EAP we could test multiple of these
approaches to find what works best.


>
> The Idea would be to store the labels (Surface Forms) and the type in
> a single SolrField such as
>
> paris|Place
> paris hilton|Person
>
> The TermsComponent supports prefix searches and returns matching
> labels. So a search for "paris" would return the two examples above.
> The Results could than be post processed to retrieve the actual type
> of the spotted Entity. If needed one could even encode the local name
> of the Resource URL as suffix to such labels.
>

That idea sounds interesting. We've debated a bit over how our index should
look like.
We've considered "stashing" data into one field, using multiple fields and
even payloads.
Unfortunately these discussions lost priority to the other more researchy
ones, but if there are more people interested in talking these through, it
may be a good time to revive the discussions.


> I had not nearly enough time to go into everything that looked
> interesting but I hope that this helps at least to start the
> discussion about the next steps.
>
> best
> Rupert Westenthaler
>
> On Wed, Feb 29, 2012 at 6:41 AM, Rupert Westenthaler
> <ru...@gmail.com> wrote:
> > Hi Pablo, all
> >
> > I am personally really interested in DBpedia Spotlight and a nice
> > integration in Stanbol would be really great. While the Stanbol
> > Enhancer and Entityhub can already be used to link entities with
> > dbpedia such components are not optimized to be used with DBpedia. In
> > short - there are still a lot of reasons why one would want a
> > dedicated Entity-Linking engine for DBpedia!
> >
> > Regarding the validation part: It would be really great if you would
> > work on Benchmarking. Bertrand Delacretaz has implemented this really
> > important feature more than a year ago but it was not really used up
> > to now, because nobody contributed test data. For interested people
> > you can try Benchmarking under "/benchmark" (e.g. on the demo server
> > [1])
> >
> > I have already checked out the Spotlight source code and I will try to
> > have a detailed look at it. I plan to provide detailed feedback on
> > technical details with a focus on potential integration paths and
> > synergies.
> >
> > Thx for the really nice proposal
> >
> > regards
> > Rupert
> >
> >
> > [1] http://dev.iks-project.eu:8081/benchmark/
> >
> > On Mon, Feb 27, 2012 at 12:30 PM, Pablo Mendes <pa...@gmail.com>
> wrote:
> >> Hi all,
> >> We are interested in joining the Early Adopters Programme (EAP) as a
> way to
> >> seed a long lasting collaboration with the Stanbol community.
> >>
> >> We are the creators of DBpedia Spotlight, a Java/Scala Open Source
> >> Enhancement Engine (Apache V2 license) that is complementary to Stanbol.
> >> DBpedia Spotlight has the ambitious goal to annotate any of the 3.5M
> >> entities from all 320 classes in the DBpedia Ontology. At the core of
> our
> >> proposal is the idea of remaining generic and configurable for many use
> >> cases. Besides the open source code, we also provide a freely available
> >> REST service that has been used to annotate cultural goods [1], generate
> >> RDFa annotations in Wordpress [2], and enhance the content in Wikipedia
> >> through a MediaWiki toolbar [3], among others [4].
> >>
> >> [1] http://dme.ait.ac.at/annotation
> >> [2] http://aksw.org/Projects/RDFaCE
> >> [3] http://pedia.sztaki.hu/
> >> [4] More at: http://wiki.dbpedia.org/spotlight/knownuses
> >>
> >> We have a demo interface that lets you tweak some parameters and see how
> >> the system works in practice:
> >> http://spotlight.dbpedia.org/demo
> >>
> >> As a first step through the EAP, shall our proposal be selected, our
> >> intention is to provide Stanbol enhancement engines based on the
> different
> >> strategies that DBpedia Spotlight uses for term recognition and
> >> disambiguation (more technical details below). For the validation part,
> one
> >> idea is to provide a benchmark comparing the perfomance (esp. accuracy)
> of
> >> the different enhancement engines in different annotated corpora that we
> >> have already collected. Would this be interesting for IKS/Stanbol? Is
> there
> >> another type of validation that would be more appealing to the
> community?
> >>
> >> Looking forward to discussing possibilities with you.
> >>
> >> Best regards,
> >> Pablo
> >>
> >> For the More Technical Folks
> >>
> >> Our content enhancement is performed in 4 stages:
> >> - Spotting recognizes terms in some input text. It can be done via
> >> substring matches in a dictionary, or with more sophisticated approaches
> >> such as NER and keyphrase extraction.
> >> - Candidate mapping matches the "spotted" terms with their possible
> >> interpretations (entity identifiers). This can also be done with a
> >> dictionary (hashmap), but offers the possibility to do fancier matching
> >> with name variations - acronyms, approximate matching, etc.
> >> - Disambiguation ranks the "candidates" given the context (e.g. words
> >> around the spotted phrase). This can also be done in many ways, locally,
> >> globally, with different scoring functions, etc.
> >> - Linking decides which of the spots to keep, given that after the
> previous
> >> steps we have more information about confidence, topical pertinence,
> etc.
> >>
> >> Other potentially interesting more technical details
> >> - Our Web service uses Jersey (JAX-RS)
> >> - The Web Service is CORS-enabled, and we have both pure JS and jQuery
> >> clients. We also have Java, Scala and PHP clients.
> >> - Users can provide SPARQL queries to blacklist/whitelist results
> >> (currently in the Linking step only, but work in progress for other
> steps).
> >
> >
> >
> > --
> > | Rupert Westenthaler             rupert.westenthaler@gmail.com
> > | Bodenlehenstraße 11                             ++43-699-11108907
> > | A-5500 Bischofshofen
>
>
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>