You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@stanbol.apache.org by Rajan Shah <ra...@gmail.com> on 2015/06/01 17:27:22 UTC

Stanbol NER questions

Hi,

I have couple questions pertaining to stanbol NER.

Thanks in advance,
Rajan


*1. Same As:*

How can I configure "Same As" within the stanbol framework?

For ex.

JP Morgan Chase is same as J.P. Morgan

*2. Entity Recognition:*

Suppose that, entity person has four properties.

a. Name
b. Title
c. Address
d. Company

When NER performs, it only brings one with the match. Suppose, I want to
retrieve all properties associated with entity to enhancer's front-end or
Graph (without additional second query) - is it possible?

*3. Entity co-mention:*

>From the documentation, it's not crystal clear that how this engine works?
Is it possible to provide a quick concrete example in couple lines?

Does it require two entities live in same solr index or namespace?

*4. Sentence Detection:*

Is it possible to provide an example configuration or a pointer which
describes key features? Also, within a sentence if there are two usage of
same word say

a. Same person detection

1. Mr. Smith - first sentence
2. Smith's - following sentence

Is it possible to recognize that both sentences are from the same person
using Stanbol?

b. Sentence pattern detection based on language grammar

Does it allow to detect sentences based on language grammar?

Re: Stanbol NER questions

Posted by Rupert Westenthaler <ru...@gmail.com>.

On Tue, Jun 2, 2015 at 2:55 PM, Rajan Shah <ra...@gmail.com> wrote:
> Hi,
>
> As I am new to stanbol I would like to try out various features. I would
> really appreciate, if someone can provide answers to Stanbol NER questions
> in this mail thread.
>
> Some other questions are as follows:
>
> 1. Geonames:
> If there is a place extracted from original text, how can I use geonames to
> obtain additional information using geonames index?

Dereferencing is used to add additional data of extracted entities.

If you want to use the Entityhub for dereferencing you will need to
create an index of Geonames [1].

[1] https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/geonames/README.md

>
> 2. Referring to other solr core indexes:
> Suppose, my entities are scattered across multiple indexes such as
>
> Entity1 exists within Index1
> Entity2 exists within Index2
>
> Is it possible to refer Entity2 from Index2 while performing NER for
> Entity1 and vice versa.
>

Information integration in RDF is done via URIs. So referring is not
an issue. The Entityhub has the /sites endpoint that allow you to
access information stored in any referenced sites. If you manage to
setup an Entityhub Dereference engine on this endpoint you could use
LDPath to dereference information for Entity2 if Entity1 is extracted
from the processed Document. However if you need to recursively
dereference information - this is not supported by any existing
Engine.

But as I sayer. RDF uses URIs so you could also retrieve those
additional information later on on the client side

best
Rupert

> Thanks in advance.
>
> With best regards,
> Rajan
>

-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                              ++43-699-11108907
| A-5500 Bischofshofen
| REDLINK.CO ..........................................................................
| http://redlink.co/

Re: Stanbol NER questions

Posted by Rajan Shah <ra...@gmail.com>.

Hi,

As I am new to stanbol I would like to try out various features. I would
really appreciate, if someone can provide answers to Stanbol NER questions
in this mail thread.

Some other questions are as follows:

1. Geonames:
If there is a place extracted from original text, how can I use geonames to
obtain additional information using geonames index?

2. Referring to other solr core indexes:
Suppose, my entities are scattered across multiple indexes such as

Entity1 exists within Index1
Entity2 exists within Index2

Is it possible to refer Entity2 from Index2 while performing NER for
Entity1 and vice versa.

Thanks in advance.

With best regards,
Rajan

On Mon, Jun 1, 2015 at 11:27 AM, Rajan Shah <ra...@gmail.com> wrote:

> Hi,
>
> I have couple questions pertaining to stanbol NER.
>
> Thanks in advance,
> Rajan
>
>
> *1. Same As:*
>
> How can I configure "Same As" within the stanbol framework?
>
> For ex.
>
> JP Morgan Chase is same as J.P. Morgan
>
> *2. Entity Recognition:*
>
> Suppose that, entity person has four properties.
>
> a. Name
> b. Title
> c. Address
> d. Company
>
> When NER performs, it only brings one with the match. Suppose, I want to
> retrieve all properties associated with entity to enhancer's front-end or
> Graph (without additional second query) - is it possible?
>
> *3. Entity co-mention:*
>
> From the documentation, it's not crystal clear that how this engine works?
> Is it possible to provide a quick concrete example in couple lines?
>
> Does it require two entities live in same solr index or namespace?
>
> *4. Sentence Detection:*
>
> Is it possible to provide an example configuration or a pointer which
> describes key features? Also, within a sentence if there are two usage of
> same word say
>
> a. Same person detection
>
> 1. Mr. Smith - first sentence
> 2. Smith's - following sentence
>
> Is it possible to recognize that both sentences are from the same person
> using Stanbol?
>
> b. Sentence pattern detection based on language grammar
>
> Does it allow to detect sentences based on language grammar?
>

Re: Stanbol NER questions

Posted by Rajan Shah <ra...@gmail.com>.

Hi Rupert,

Thanks a lot for the detailed answer.

Is there any plan from Christian to get something soon? or Is it even on
Stanbol roadmap for coming quarter? How can someone vote for the feature
request?

Suppose, in the meantime if I want to develop my custom enhancer to capture
very small subset of the feature request where two entities are associated
with simple relationship.

For ex.
Apple buys Metaio

What is the best way to approach in current framework? Is it possible to
provide some snippet/reference?

With best regards,
Rajan

On Mon, Jun 8, 2015 at 3:46 AM, Rupert Westenthaler <
rupert.westenthaler@gmail.com> wrote:

> Hi Rajan,
>
> regarding dereferenceing:
>
>
> For small and medium sized Datasets using the SolrYard for both is the
> way to go. For big datasets (e.g. dbpedia) you can still use the
> SolrYard, but the size of the SolrCore will be much bigger as the size
> of the TripleStore. This is because Solr stores documents (stored
> fields) while the Triple Store stores a Grpah. So e.g. if your dataset
> contains 200k dbpedia:Person the SolrYard would store the URI
> "dbpedia:Person" 200k times. In the TripleStore you will just store it
> a single time. So while Solr does (by default) compress stored filed
> it will still be more inefficient for storage if your dataset contains
> a lot of URI values. If your dataset uses mainly Literal values this
> does not apply.
>
> On the other hand: Solr is amazingly fast for dereferencing ^^
>
>
> regarding Entity co-mention
>
> >> >
> >> > *3. Entity co-mention:*
> >> >
> >> > From the documentation, it's not crystal clear that how this engine
> >> works?
> >> > Is it possible to provide a quick concrete example in couple lines?
> >> >
> >> > Does it require two entities live in same solr index or namespace?
> >>
> >> IMO the example
> >>
> >>     ... Barack Obama gave a talk to members of the Labor Union ...
> >> Obama specially mentioned ...
> >>
> >> describes it well. Because "Barack Obama" is already mentioned before
> >> "Obama" is treated as a co-mention. The engine builds an index over
> >> mentions of previous fise:TextAnnotation. It only works on data
> >> already present in the ContentItem. Id does not require to have the CV
> >> in any specific storage (e.g. the Entityhub).
> >>
> >>
> > Is there any plan to extend it to capture the relation such as
> > "Researcher1" and "Researcher2" are two different entities and they're
> > mentioned in a research paper published by both of them?
>
> This more putting three entities (researcher1, researcher2, the
> research paper) in context to each others. Cristian Petroaca is doing
> some work on this but their is nothing ready to be used ATM. You might
> be interested in STANBOL-1121 and maybe
> http://markmail.org/message/3fqdprc7nsjgaz3t for more background
> information.
>
> best
> Rupert
>
>
> On Tue, Jun 2, 2015 at 6:06 PM, Rajan Shah <ra...@gmail.com> wrote:
> > Cool. Thanks a lot for the quick reply.
> >
> > Yes, it works very well.
> >
> > With best regards,
> > Rajan
> >
> > On Tue, Jun 2, 2015 at 10:57 AM, ajs6f@virginia.edu <aj...@virginia.edu>
> > wrote:
> >
> >> On Jun 2, 2015, at 10:54 AM, Rajan Shah <ra...@gmail.com> wrote:
> >>
> >> > In this case, is it fair to assume that one needs to have both of
> these
> >> > yards?
> >> >
> >> > a. Solr yard for fast search
> >> > b. Clerzza yard for dereference
> >> >
> >> > Is this the optimal way to use stanbol NER and leverage full
> potential?
> >>
> >> If your entity definitions are relatively simple (no bnodes, no
> "internal
> >> structure", just predicates with simple values) you can dereference them
> >> perfectly well from a SolrYard.
> >>
> >>
> >> ---
> >> A. Soroka
> >> The University of Virginia Library
> >>
> >>
> >>
>
>
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                              ++43-699-11108907
> | A-5500 Bischofshofen
> | REDLINK.CO
> ..........................................................................
> | http://redlink.co/
>

Re: Stanbol NER questions

Posted by Rupert Westenthaler <ru...@gmail.com>.

Hi Rajan,

regarding dereferenceing:

For small and medium sized Datasets using the SolrYard for both is the
way to go. For big datasets (e.g. dbpedia) you can still use the
SolrYard, but the size of the SolrCore will be much bigger as the size
of the TripleStore. This is because Solr stores documents (stored
fields) while the Triple Store stores a Grpah. So e.g. if your dataset
contains 200k dbpedia:Person the SolrYard would store the URI
"dbpedia:Person" 200k times. In the TripleStore you will just store it
a single time. So while Solr does (by default) compress stored filed
it will still be more inefficient for storage if your dataset contains
a lot of URI values. If your dataset uses mainly Literal values this
does not apply.

On the other hand: Solr is amazingly fast for dereferencing ^^

regarding Entity co-mention

>> >
>> > *3. Entity co-mention:*
>> >
>> > From the documentation, it's not crystal clear that how this engine
>> works?
>> > Is it possible to provide a quick concrete example in couple lines?
>> >
>> > Does it require two entities live in same solr index or namespace?
>>
>> IMO the example
>>
>>     ... Barack Obama gave a talk to members of the Labor Union ...
>> Obama specially mentioned ...
>>
>> describes it well. Because "Barack Obama" is already mentioned before
>> "Obama" is treated as a co-mention. The engine builds an index over
>> mentions of previous fise:TextAnnotation. It only works on data
>> already present in the ContentItem. Id does not require to have the CV
>> in any specific storage (e.g. the Entityhub).
>>
>>
> Is there any plan to extend it to capture the relation such as
> "Researcher1" and "Researcher2" are two different entities and they're
> mentioned in a research paper published by both of them?

This more putting three entities (researcher1, researcher2, the
research paper) in context to each others. Cristian Petroaca is doing
some work on this but their is nothing ready to be used ATM. You might
be interested in STANBOL-1121 and maybe
http://markmail.org/message/3fqdprc7nsjgaz3t for more background
information.

best
Rupert

On Tue, Jun 2, 2015 at 6:06 PM, Rajan Shah <ra...@gmail.com> wrote:
> Cool. Thanks a lot for the quick reply.
>
> Yes, it works very well.
>
> With best regards,
> Rajan
>
> On Tue, Jun 2, 2015 at 10:57 AM, ajs6f@virginia.edu <aj...@virginia.edu>
> wrote:
>
>> On Jun 2, 2015, at 10:54 AM, Rajan Shah <ra...@gmail.com> wrote:
>>
>> > In this case, is it fair to assume that one needs to have both of these
>> > yards?
>> >
>> > a. Solr yard for fast search
>> > b. Clerzza yard for dereference
>> >
>> > Is this the optimal way to use stanbol NER and leverage full potential?
>>
>> If your entity definitions are relatively simple (no bnodes, no "internal
>> structure", just predicates with simple values) you can dereference them
>> perfectly well from a SolrYard.
>>
>>
>> ---
>> A. Soroka
>> The University of Virginia Library
>>
>>
>>

-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                              ++43-699-11108907
| A-5500 Bischofshofen
| REDLINK.CO ..........................................................................
| http://redlink.co/

Re: Stanbol NER questions

Posted by Rajan Shah <ra...@gmail.com>.

Cool. Thanks a lot for the quick reply.

Yes, it works very well.

With best regards,
Rajan

On Tue, Jun 2, 2015 at 10:57 AM, ajs6f@virginia.edu <aj...@virginia.edu>
wrote:

> On Jun 2, 2015, at 10:54 AM, Rajan Shah <ra...@gmail.com> wrote:
>
> > In this case, is it fair to assume that one needs to have both of these
> > yards?
> >
> > a. Solr yard for fast search
> > b. Clerzza yard for dereference
> >
> > Is this the optimal way to use stanbol NER and leverage full potential?
>
> If your entity definitions are relatively simple (no bnodes, no "internal
> structure", just predicates with simple values) you can dereference them
> perfectly well from a SolrYard.
>
>
> ---
> A. Soroka
> The University of Virginia Library
>
>
>

Re: Stanbol NER questions

Posted by "ajs6f@virginia.edu" <aj...@virginia.edu>.

On Jun 2, 2015, at 10:54 AM, Rajan Shah <ra...@gmail.com> wrote:

> In this case, is it fair to assume that one needs to have both of these
> yards?
> 
> a. Solr yard for fast search
> b. Clerzza yard for dereference
> 
> Is this the optimal way to use stanbol NER and leverage full potential?

If your entity definitions are relatively simple (no bnodes, no "internal structure", just predicates with simple values) you can dereference them perfectly well from a SolrYard.

---
A. Soroka
The University of Virginia Library

Re: Stanbol NER questions

Posted by Rajan Shah <ra...@gmail.com>.

Hi Rupert,

Thanks a lot for your detailed answer.

Some quick follow-up questions.

On Tue, Jun 2, 2015 at 10:32 AM, Rupert Westenthaler <
rupert.westenthaler@gmail.com> wrote:

> Hi Rajan,
>
> On Mon, Jun 1, 2015 at 5:27 PM, Rajan Shah <ra...@gmail.com> wrote:
> >
> > *1. Same As:*
> >
> > How can I configure "Same As" within the stanbol framework?
> >
> > For ex.
> >
> > JP Morgan Chase is same as J.P. Morgan
> >
>
> If you manage this in your own Controlled Vocabulary (CV) you would
> just add a preferred and 0..n alternate labels. if you have existing
> CV that do use owl:sameAs relations you can convert them during
> indexing the CV with the Entityhub indexing tool (similar as done in
> the sHealth example) or collect them while dereferencing and process
> them on the client side.
>
>
> > *2. Entity Recognition:*
> >
> > Suppose that, entity person has four properties.
> >
> > a. Name
> > b. Title
> > c. Address
> > d. Company
> >
> > When NER performs, it only brings one with the match. Suppose, I want to
> > retrieve all properties associated with entity to enhancer's front-end or
> > Graph (without additional second query) - is it possible?
>
> Have a look at the Dereference Engines
>
> http://stanbol.apache.org/docs/trunk/components/enhancer/engines/list#dereference-entities
>
>
In this case, is it fair to assume that one needs to have both of these
yards?

a. Solr yard for fast search
b. Clerzza yard for dereference

Is this the optimal way to use stanbol NER and leverage full potential?  In
a Referenced Site, I see that there exists a searcher implementation. Could
someone provide some pointers on "what are the real benefits" of using such
implementation?


> >
> > *3. Entity co-mention:*
> >
> > From the documentation, it's not crystal clear that how this engine
> works?
> > Is it possible to provide a quick concrete example in couple lines?
> >
> > Does it require two entities live in same solr index or namespace?
>
> IMO the example
>
>     ... Barack Obama gave a talk to members of the Labor Union ...
> Obama specially mentioned ...
>
> describes it well. Because "Barack Obama" is already mentioned before
> "Obama" is treated as a co-mention. The engine builds an index over
> mentions of previous fise:TextAnnotation. It only works on data
> already present in the ContentItem. Id does not require to have the CV
> in any specific storage (e.g. the Entityhub).
>
>
Is there any plan to extend it to capture the relation such as
"Researcher1" and "Researcher2" are two different entities and they're
mentioned in a research paper published by both of them?


> >
> > *4. Sentence Detection:*
> >
> > Is it possible to provide an example configuration or a pointer which
> > describes key features? Also, within a sentence if there are two usage of
> > same word say
> >
> > a. Same person detection
> >
> > 1. Mr. Smith - first sentence
> > 2. Smith's - following sentence
> >
> > Is it possible to recognize that both sentences are from the same person
> > using Stanbol?
>
> There is currently no such engine. Cristian was working to extend the
> Stanbol NLP API to support dependency Trees and Co-Reference. He also
> extended the Stanbol Stanford NLP integration to support those
> features. However their is no engine supporting those features on the
> Stanbol side.
>
> >
> > b. Sentence pattern detection based on language grammar
> >
>
> no
>
> > Does it allow to detect sentences based on language grammar?
>
> no
>
> best
> Rupert
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                              ++43-699-11108907
> | A-5500 Bischofshofen
> | REDLINK.CO
> ..........................................................................
> | http://redlink.co/
>

Re: Stanbol NER questions

Posted by Rupert Westenthaler <ru...@gmail.com>.

Hi Rajan,

On Mon, Jun 1, 2015 at 5:27 PM, Rajan Shah <ra...@gmail.com> wrote:
>
> *1. Same As:*
>
> How can I configure "Same As" within the stanbol framework?
>
> For ex.
>
> JP Morgan Chase is same as J.P. Morgan
>

If you manage this in your own Controlled Vocabulary (CV) you would
just add a preferred and 0..n alternate labels. if you have existing
CV that do use owl:sameAs relations you can convert them during
indexing the CV with the Entityhub indexing tool (similar as done in
the sHealth example) or collect them while dereferencing and process
them on the client side.

> *2. Entity Recognition:*
>
> Suppose that, entity person has four properties.
>
> a. Name
> b. Title
> c. Address
> d. Company
>
> When NER performs, it only brings one with the match. Suppose, I want to
> retrieve all properties associated with entity to enhancer's front-end or
> Graph (without additional second query) - is it possible?

Have a look at the Dereference Engines
http://stanbol.apache.org/docs/trunk/components/enhancer/engines/list#dereference-entities

>
> *3. Entity co-mention:*
>
> From the documentation, it's not crystal clear that how this engine works?
> Is it possible to provide a quick concrete example in couple lines?
>
> Does it require two entities live in same solr index or namespace?

IMO the example

    ... Barack Obama gave a talk to members of the Labor Union ...
Obama specially mentioned ...

describes it well. Because "Barack Obama" is already mentioned before
"Obama" is treated as a co-mention. The engine builds an index over
mentions of previous fise:TextAnnotation. It only works on data
already present in the ContentItem. Id does not require to have the CV
in any specific storage (e.g. the Entityhub).

>
> *4. Sentence Detection:*
>
> Is it possible to provide an example configuration or a pointer which
> describes key features? Also, within a sentence if there are two usage of
> same word say
>
> a. Same person detection
>
> 1. Mr. Smith - first sentence
> 2. Smith's - following sentence
>
> Is it possible to recognize that both sentences are from the same person
> using Stanbol?

There is currently no such engine. Cristian was working to extend the
Stanbol NLP API to support dependency Trees and Co-Reference. He also
extended the Stanbol Stanford NLP integration to support those
features. However their is no engine supporting those features on the
Stanbol side.

>
> b. Sentence pattern detection based on language grammar
>

no

> Does it allow to detect sentences based on language grammar?

no

best
Rupert

-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                              ++43-699-11108907
| A-5500 Bischofshofen
| REDLINK.CO ..........................................................................
| http://redlink.co/