You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Cristian Petroaca <cr...@gmail.com> on 2015/12/15 20:30:08 UTC

Re: Event Extraction Engine

Hi All,

I've defined the rules and ontology schemas.
We will have 2 types of files : rule files and ontology files.

Ontology files  have .ont extension and they contain the following
structure:
Name=Dbpedia
BaseUrl=http://dbpedia.org
#Entities
Acquisition=$BaseUrl/page/Aquisition
Financing=$BaseUrl/page/Financing
Investment=$BaseUrl/page/Investment
Organization=$BaseUrl/page/Organisation
Person=$BaseUrl/page/Person
Money=$BaseUrl/page/Money

In these files we define variables which point to ontology specific
entities. These will be used later in the rule files.

Rule files have .rl extension and they contain the following structure:

{
type=$Aquisition,
trigger=("acquire", Verb),
trigger=("buy", Verb),
agent=($Organisation, NominalSubject, true),
patient=($Organisation, DirectObject, true)
},
{
type=$Financing,
trigger=("raise", Verb),
agent=($Organisation || $Person, NominalSubject, true),
patient=($Money, DirectObject, true),
source=($Organisation || $Person, nmod:from, false),
}

Each event is between parenthesys with:
* type = points to the variable which is defined in .ont files and defines
the entity for the event
* trigger = the element that can trigger the event. We need to define the
actual literal and the POS LedxicalCategory that that literal has. There
can be multiple triggers
* thematic relation elemets (agent, patient, etc) = other elements that can
compose an event. such as an agent or a patient. You must specify the
entity types associated with the element, entity types which are defined as
variables in the .ont files, the GrammaticalRelation of the element and
whether the element's existence is mandatory to classify the event
correctly or not.

There can be multiple .rl and .ont files for easier management.

The .rl files will be all zipped into one file and will be fed to the
EventExtraction engine via the OSGI console. The same for the .ont files.
Then the EventExtractionEngine will store the rules in memory and will
perform event extraction for each sentence by checking for the triggers and
once identified will try to extract the other thematic relation elements.

What do you think?

Cristian


On Thu, Nov 19, 2015 at 6:59 AM, Dileepa Jayakody <dileepajayakody@gmail.com
> wrote:

> Hi Cristian,
>
> Great stuff!
> I will look into Stanford NLP project to see how we can do that.
>
> Regards,
> Dileepa
>
> On Thu, Nov 19, 2015 at 2:06 AM, Cristian Petroaca <
> cristian.petroaca@gmail.com> wrote:
>
> > I created a git repository which contains the event extraction engine
> here
> > https://github.com/cpetroaca/stanbol-event-extraction-engine. I've
> started
> > working on an event rule schema that will also incorporate a generic
> > ontology definition schema so that one can say that #Person=
> > http://dbpedia.org/Person and then use #Person in the rules. I think
> that
> > because Stanbol has access to a dbpedia or yago index will be of great
> > value when we want to define events with specific object classes.
> >
> > Dileepa, if you still want to get involved, you can take a look at the
> > Stanbol Stanford NLP project here
> > https://github.com/westei/stanbol-stanfordnlp and figure out how to add
> > Collapsed Dependencies(
> > http://nlp.stanford.edu/software/dependencies_manual.pdf)  to it. We'll
> > need them to sort out the subject, verb and objects.
> >
> > Thanks,
> > Cristian
> >
> > On Mon, Oct 12, 2015 at 3:31 PM, Cristian Petroaca <
> > cristian.petroaca@gmail.com> wrote:
> >
> > > Can we get a separate branch where we can start developing the Event
> > > Extraction engine?
> > >
> > > Thanks
> > >
> > > On Sun, Sep 20, 2015 at 4:26 PM, Cristian Petroaca <
> > > cristian.petroaca@gmail.com> wrote:
> > >
> > >> Sorry, hit sent before finishing the mail :).
> > >>
> > >> So, you will disambiguate it using wordnet like this :
> > >>
> > >>
> >
> http://wordnetweb.princeton.edu/perl/webwn?s=attack&sub=Search+WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=000000
> > >>
> > >> And then you would have a rule file which would contain something
> like :
> > >> event name= "attack"
> > >> event trigger= wordnet class of type = wordnet id && pos=verb
> > >> agent=dependency_type:nsubj&&entity_type=Person||Location
> > >> patient=dependency_type:dobj&&entity_type=Person||Location
> > >>
> > >> The dependecy type points to the Stanford NLP dependency tree relation
> > >> types described here:
> > >> http://nlp.stanford.edu/software/stanford-dependencies.shtml
> > >> The entity_type points to either the NER class or the wordnet class
> for
> > >> the noun in the noun phrase.
> > >>
> > >> This approach was inspired by this paper :
> > >> http://www.surdeanu.info/mihai/papers/acl2015.pdf with the difference
> > >> that I'm using WSD to disambiguate the event trigger.
> > >>
> > >> I'll start doing some experiments with this approach.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Sun, Sep 20, 2015 at 4:14 PM, Cristian Petroaca <
> > >> cristian.petroaca@gmail.com> wrote:
> > >>
> > >>> Hi Dileepa,
> > >>>
> > >>> I've been thinking more about the approach using a Word Sense
> > >>> Disambiguation tool to classify the verb in the sentence and I think
> > it may
> > >>> be a good approach. The verb seems to be the event trigger and once
> you
> > >>> know its actual meaning (by applying a Wordnet class or some other DB
> > used
> > >>> for WSD) then I think it's quite straightforward to identify the
> > actors in
> > >>> the event (agent, patient, instrument, etc) by applying some user
> > defined
> > >>> rules for that verb class.
> > >>>
> > >>> For example if you have the verb "attack" which can have multiple
> > >>> meanings depending on the context you will disambiguate it using
> > wordnet
> > >>> like this:
> > >>>
> > >>> On Wed, Sep 9, 2015 at 8:33 PM, Dileepa Jayakody <
> > >>> dileepajayakody@gmail.com> wrote:
> > >>>
> > >>>> Hi Cristian,
> > >>>>
> > >>>> Interesting ideas. Let me do some background reading on this, so I
> can
> > >>>> also
> > >>>> participate in the discussion better.
> > >>>>
> > >>>> Thanks,
> > >>>> Dileepa
> > >>>>
> > >>>> On Wed, Sep 9, 2015 at 3:17 PM, Cristian Petroaca <
> > >>>> cristian.petroaca@gmail.com> wrote:
> > >>>>
> > >>>> > Another approach to this would be to use a semantic role labeling
> > >>>> tool [1]
> > >>>> > to determine the type of relation between the subject and object.
> > >>>> >
> > >>>> > Or we could use Word Sense Disambiguation to determine the wordnet
> > >>>> class of
> > >>>> > the verb (this way we have a standard relation definition) and
> based
> > >>>> on
> > >>>> > what relation type it is we can search for the subject and object
> > >>>> using
> > >>>> > dependency tree parsing in Stanford NLP.
> > >>>> >
> > >>>> > These 2 options ensure that we can have a much bigger recall but
> I'm
> > >>>> not
> > >>>> > sure about the precision...
> > >>>> >
> > >>>> > So I think we'll need to first settle on the method of
> implementing
> > >>>> this
> > >>>> > engine before starting anything.
> > >>>> >
> > >>>> > [1] http://cogcomp.cs.illinois.edu/page/demo_view/srl
> > >>>> >
> > >>>> > On Tue, Sep 8, 2015 at 11:45 AM, Cristian Petroaca <
> > >>>> > cristian.petroaca@gmail.com> wrote:
> > >>>> >
> > >>>> > > Hi Dileepa,
> > >>>> > >
> > >>>> > > Unfortunately I did not have the time to work on this at all so
> > >>>> there is
> > >>>> > > no code base . But I'd be happy to start contributing with
> > >>>> something to
> > >>>> > > this engine and I think it would also be very helpful if you
> will
> > >>>> be able
> > >>>> > > to contribute to this as well.
> > >>>> > > I did get a chance to test the Stanford relation extractor which
> > >>>> works
> > >>>> > > fine but it's quite limited to a handful of relation types
> > (live_in,
> > >>>> > > located_in, org_based_in, work_for). So we would need to train
> > other
> > >>>> > models
> > >>>> > > if we want to increase the relation type number.
> > >>>> > > I also think that the Event Extraction Engine should work in
> > >>>> conjunction
> > >>>> > > with any coreference and comention engines we have to increase
> the
> > >>>> > relation
> > >>>> > > count.
> > >>>> > >
> > >>>> > > Regards,
> > >>>> > > Cristian
> > >>>> > >
> > >>>> > > On Tue, Sep 8, 2015 at 11:19 AM, Dileepa Jayakody <
> > >>>> > > dileepajayakody@gmail.com> wrote:
> > >>>> > >
> > >>>> > >> Hi Cristian and all,
> > >>>> > >>
> > >>>> > >> Can I please know the status of this event extraction engine?
> > Event
> > >>>> > >> extraction is a really useful feature for semantic enhancements
> > >>>> and I am
> > >>>> > >> interested in collaborating with this work.
> > >>>> > >> Is there any code base you are currently working on for this
> > engine
> > >>>> > work?
> > >>>> > >>
> > >>>> > >> Thanks,
> > >>>> > >> Dileepa
> > >>>> > >>
> > >>>> > >> On Tue, Feb 17, 2015 at 9:10 PM, Cristian Petroaca <
> > >>>> > >> cristian.petroaca@gmail.com> wrote:
> > >>>> > >>
> > >>>> > >> > Hi Edi,
> > >>>> > >> >
> > >>>> > >> > Thanks for the info. Stanford Relation Extractor sounds very
> > >>>> > >> interesting.
> > >>>> > >> > I'll give it a try.
> > >>>> > >> >
> > >>>> > >> > 2015-02-17 17:00 GMT+02:00 Edi Bice
> <edi_bice@yahoo.com.invalid
> > >>>> >:
> > >>>> > >> >
> > >>>> > >> > > Hi Cristian,
> > >>>> > >> > > Here are a few more resources on Semantic Role/Relationship
> > >>>> > Labeling:
> > >>>> > >> > > 1. FrameNet, VerbNet and WordNet on the data side2.
> > >>>> Shalmaneser,
> > >>>> > >> SEMAFOR
> > >>>> > >> > > and Stanford Relation Extractor on the code side
> > >>>> > >> > > The last one links to a great paper which I believe holds
> > great
> > >>>> > >> potential
> > >>>> > >> > > for Stanbol:
> > >>>> > >> > > A Linear Programming Formulation for Global Inference in
> > >>>> Natural
> > >>>> > >> Language
> > >>>> > >> > > Tasks
> > >>>> > >> > >
> > >>>> > >> > > |   |
> > >>>> > >> > > |   |   |   |   |   |
> > >>>> > >> > > | A Linear Programming Formulation for Global Inference in
> > >>>> Natural
> > >>>> > >> > > Language Tasks  Last abstract |Contents |Next abstract A
> > Linear
> > >>>> > >> > Programming
> > >>>> > >> > > Formulation for Global Inference in Natural Language
> Tasks  |
> > >>>> > >> > > |  |
> > >>>> > >> > > | View on www.cnts.ua.ac.be | Preview by Yahoo |
> > >>>> > >> > > |  |
> > >>>> > >> > > |   |
> > >>>> > >> > >
> > >>>> > >> > >
> > >>>> > >> > >
> > >>>> > >> > > Edi
> > >>>> > >> > >       From: Cristian Petroaca <cristian.petroaca@gmail.com
> >
> > >>>> > >> > >  To: dev@stanbol.apache.org
> > >>>> > >> > >  Sent: Sunday, February 15, 2015 6:34 AM
> > >>>> > >> > >  Subject: Event Extraction Engine
> > >>>> > >> > >
> > >>>> > >> > > Hi All,
> > >>>> > >> > >
> > >>>> > >> > > Quite a while ago I started a discussion on this list about
> > >>>> Event
> > >>>> > >> > > Extraction from text. See
> > >>>> > >> > > https://issues.apache.org/jira/browse/STANBOL-1121
> > >>>> > >> > > .
> > >>>> > >> > >
> > >>>> > >> > > I'd like to get started on the actual work and I have been
> > >>>> thinking
> > >>>> > >> how
> > >>>> > >> > to
> > >>>> > >> > > best approach this and there are some things that I would
> do
> > >>>> > >> differently
> > >>>> > >> > > than what the JIRA describes.I'd like to get your feedback
> on
> > >>>> it.
> > >>>> > >> > >
> > >>>> > >> > > Basically the main approach would be:
> > >>>> > >> > >
> > >>>> > >> > > 1. Detect all NERs and their co-references.
> > >>>> > >> > >
> > >>>> > >> > > 2. Apply semantic role labeling on the sentences where the
> > >>>> above
> > >>>> > >> > mentioned
> > >>>> > >> > > NERs reside.
> > >>>> > >> > > I found some interesting Semantic Role labeling libraries
> > such
> > >>>> as
> > >>>> > >> > > https://code.google.com/p/mate-tools/ or
> > >>>> > >> > > http://cogcomp.cs.illinois.edu/page/software_view/SRL.
> > >>>> > >> > > With this I'll be able to detect the Agent, the Verb
> (action)
> > >>>> and
> > >>>> > the
> > >>>> > >> > > Patient and Instruments.
> > >>>> > >> > >
> > >>>> > >> > > This could be a minimal implementation of the engine. After
> > >>>> that I
> > >>>> > can
> > >>>> > >> > > simply create the event data model as described in the JIRA
> > and
> > >>>> > >> annotate
> > >>>> > >> > > the text.
> > >>>> > >> > > But this does not actually detect what kind of event it is
> or
> > >>>> what
> > >>>> > are
> > >>>> > >> > the
> > >>>> > >> > > event specific roles that the entities have in the
> relation.
> > >>>> > >> > >
> > >>>> > >> > > For example we can have the sentence "Google buys Yahoo for
> > >>>> $100
> > >>>> > >> > million".
> > >>>> > >> > > There are a lot more to be said about this sentence than
> > >>>> simply that
> > >>>> > >> > > "Google" is the agent and "Yahoo" is the Patient. This is
> > >>>> actually
> > >>>> > an
> > >>>> > >> > > acquisition event and "Google" is the buyer and "Yahoo" the
> > >>>> bought
> > >>>> > >> > entity.
> > >>>> > >> > > We also would need to align to a common ontology synonym
> > >>>> phrases
> > >>>> > such
> > >>>> > >> as
> > >>>> > >> > > "buy" or "acquire" so that we know that both refer to the
> > same
> > >>>> > >> > Acquisition
> > >>>> > >> > > event.
> > >>>> > >> > >
> > >>>> > >> > > Having said that, we would add a new step :
> > >>>> > >> > > 3. Try to detect event type and event details.
> > >>>> > >> > >
> > >>>> > >> > > This can be done by either:
> > >>>> > >> > >
> > >>>> > >> > > 3.1 Rule based : hand written rules which would map a
> certain
> > >>>> > sentence
> > >>>> > >> > > structure, such as the name of the verb and the type of
> > >>>> entities as
> > >>>> > >> > agent,
> > >>>> > >> > > patient to a certain event type.
> > >>>> > >> > > This has the benefit of being easy to build but quite
> > >>>> inflexible.
> > >>>> > >> > >
> > >>>> > >> > > 3.2 Statistical based: train a model which would be able to
> > >>>> classify
> > >>>> > >> an
> > >>>> > >> > > event type based on the features of the sentence such as
> verb
> > >>>> type,
> > >>>> > >> > entity
> > >>>> > >> > > type, role type, etc.. This is the approach described here
> :
> > >>>> > >> > > http://web.stanford.edu/~jurafsky/mintz.pdf.
> > >>>> > >> > > This would be quite hard to build but quite flexible.
> > >>>> > >> > >
> > >>>> > >> > > This 3rd step of detecting event types & details I think
> > would
> > >>>> be
> > >>>> > most
> > >>>> > >> > > efficient for domain specific events. We would have configs
> > >>>> with
> > >>>> > >> several
> > >>>> > >> > > models for several domains available and the user could
> with
> > >>>> use one
> > >>>> > >> of
> > >>>> > >> > the
> > >>>> > >> > > pre-existent models or create a new one.
> > >>>> > >> > >
> > >>>> > >> > > I don't have any practical experience with training models
> or
> > >>>> text
> > >>>> > >> > > classification based on features (but I've been doing a lot
> > of
> > >>>> > >> reading on
> > >>>> > >> > > it) so I'm not sure exactly how feasible what I said at
> point
> > >>>> no 3
> > >>>> > >> > actually
> > >>>> > >> > > is.
> > >>>> > >> > >
> > >>>> > >> > > Regards,
> > >>>> > >> > > Cristian
> > >>>> > >> > >
> > >>>> > >> > >
> > >>>> > >> > >
> > >>>> > >> > >
> > >>>> > >> >
> > >>>> > >>
> > >>>> > >
> > >>>> > >
> > >>>> >
> > >>>>
> > >>>
> > >>>
> > >>
> > >
> >
>