You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@stanbol.apache.org by Cristian Petroaca <cr...@gmail.com> on 2014/02/04 09:50:53 UTC

Re: Named entity coref resolution based on dbpedia categories and rdf:type

Back with a more detailed description of the steps for making this kind of
coreference work.

I will be using references to the following text in the steps below in
order to make things clearer : "Microsoft posted its 2013 earnings. The
software company made a huge profit."

1. For every noun phrase in the text which has :
    a. a determinate pos which implies reference to an entity local to the
text, such as "the, this, these") but not "another, every", etc which
implies a reference to an entity outside of the text.
    b. having at least another noun aside from the main required noun which
further describes it. For example I will not count "The company" as being a
legitimate candidate since this could create a lot of false positives by
considering the double meaning of some words such as "in the company of
good people".
"The software company" is a good candidate since we also have "software".

2. match the nouns in the noun phrase to the contents of the dbpedia
categories of each named entity found prior to the location of the noun
phrase in the text.
The dbpedia categories are in the following format (for Microsoft for
example) : "Software companies of the United States".
 So we try to match "software company" with that.
First, as you can see, the main noun in the dbpedia category has a plural
form and it's the same for all categories which I saw. I don't know if
there's an easier way to do this but I thought of applying a lemmatizer on
the category and the noun phrase in order for them to have a common
denominator.This also works if the noun phrase itself has a plural form.

Second, I'll need to use for comparison only the words in the category
which are themselves nouns and not prepositions or determiners such as "of
the".This means that I need to pos tag the categories contents as well.
I was thinking of running the pos and lemma on the dbpedia categories when
building the dbpedia backed entity hub and storing them for later use - I
don't know how feasible this is at the moment.

After this I can compare each noun in the noun phrase with the equivalent
nouns in the categories and based on the number of matches I can create a
confidence level.

3. match the noun of the noun phrase with the rdf:type from dbpedia of the
named entity. If this matches increase the confidence level.

4. If there are multiple named entities which can match a certain noun
phrase then link the noun phrase with the closest named entity prior to it
in the text.

What do you think?

Cristian

2014-01-31 Cristian Petroaca <cr...@gmail.com>:

> Hi Rafa,
>
> I don't yet have a concrete heursitic but I'm working on it. I'll provide
> it here so that you guys can give me a feedback on it.
>
> What are "locality" features?
>
> I looked at Bart and other coref tools such as ArkRef and CherryPicker and
> they don't provide such a coreference.
>
> Cristian
>
>
> 2014-01-30 Rafa Haro <rh...@apache.org>:
>
> Hi Cristian,
>>
>> Without having more details about your concrete heuristic, in my honest
>> opinion, such approach could produce a lot of false positives. I don't know
>> if you are planning to use some "locality" features to detect such
>> coreferences but you need to take into account that it is quite usual that
>> coreferenced mentions can occurs even in different paragraphs. Although I'm
>> not an expert in Natural Language Understanding, I would say it is quite
>> difficult to get decent precision/recall rates for coreferencing using
>> fixed rules. Maybe you can give a try to others tools like BART (
>> http://www.bart-coref.org/).
>>
>> Cheers,
>> Rafa Haro
>>
>> El 30/01/14 10:33, Cristian Petroaca escribió:
>>
>>  Hi,
>>>
>>> One of the necessary steps for implementing the Event extraction Engine
>>> feature : https://issues.apache.org/jira/browse/STANBOL-1121 is to have
>>> coreference resolution in the given text. This is provided now via the
>>> stanford-nlp project but as far as I saw this module is performing mostly
>>> pronomial (He, She) or nominal (Barack Obama and Mr. Obama) coreference
>>> resolution.
>>>
>>> In order to get more coreferences from the text I though of creating some
>>> logic that would detect this kind of coreference :
>>> "Apple reaches new profit heights. The software company just announced
>>> its
>>> 2013 earnings."
>>> Here "The software company" obviously refers to "Apple".
>>> So I'd like to detect coreferences of Named Entities which are of the
>>> rdf:type of the Named Entity , in this case "company" and also have
>>> attributes which can be found in the dbpedia categories of the named
>>> entity, in this case "software".
>>>
>>> The detection of coreferences such as "The software company" in the text
>>> would also be done by either using the new Pos Tag Based Phrase
>>> extraction
>>> Engine (noun phrases) or by using a dependency tree of the sentence and
>>> picking up only subjects or objects.
>>>
>>> At this point I'd like to know if this kind of logic would be useful as a
>>> separate Enhancement Engine (in case the precision and recall are good
>>> enough) in Stanbol?
>>>
>>> Thanks,
>>> Cristian
>>>
>>>
>>
>