You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by Darren Cruse <da...@gmail.com> on 2010/12/23 18:22:25 UTC

UIMA for extracting book "entities" from tables of contents, etc. as RDF?

Hi guys I apologize for a newbie question but I'm quite new to UIMA and the
whole area of information extraction/entity extraction.  And I'm hoping
someone can tell me if UIMA is a proper tool for a project that I've been
working on (with other tools) that I've been having trouble with.


Basically the task is to extract meta data from html in the form of RDF.
 Where the html represents books/articles/papers/etc. that typically have an
"outline" or "table of contents", and part of the task involves extracting
the entities "behind" (so to speak) the table of contents.


So e.g. if the "corpus" of html pages are from a book, and the book has
Volume 1 and Volume 2, Volume 1 has Chapters 1-18, Chapter 1 has 6 Sections,
Section 1 has three Parts, etc.  Then my resulting RDF has to model these
things (entities/classes/whatever you'd call them) and understand the
"hierarchy" of what contains what.


The real challenging part is that it's a pretty large volume of material
with many different books/articles/papers/etc.  And there is a lot of
variability, as each were authored by different people not following any
particular template.


For example what I called a "table of contents" is rarely a single page but
more often it's exploded across multiple "outline" pages where e.g. a high
level table of contents page goes to the level of chapter links.  And then
each chapter may have it's own "outline" breaking down the sections within
that chapter.  Or it might not, different books can differ.  For example the
pages making up the chapter may just have headings referring to the
titles/names of the sections without being organized into a chapter
"outline" at all.  Yet I'm still responsible for identifying what the
sections are.


Somewhat helpful is that headings often indicate the kind of thing they are,
e.g. "Section 3:  The Life of the Spleen, Wrap-Up".  Not always though, e.g.
I may only get the "The Life of the Spleen, Wrap-Up" part (without "Section
3:" on the front).


Or I may get both forms in different places in the book, where ideally I
should relate the two references as being the same thing.


And where different places can refer to the same thing with other
differences too.  Possibly the case of the letters differ, or in this
example there could be one heading with "Wrap-Up" and another with  "Wrap
Up" (one with the dash the other without the dash).


As far as understanding the relationships between things i.e. that Chapter 3
contains Sections 1 through 3 and Section 1 contains two "Parts", where the
things do appear in a "table of contents" or "outline" page, it seems like
the arrangement/formatting of those pages give the clue as to "what contains
what".  i.e. Things "contained" typically follow what they're contained by,
and are often indented (but not necessarily, it can just be that the
"parent" is bolded, yet they might not be indented beneath their "parent").



Apologize for the long winded description but hopefully it will help to
clarify my question since I'm new to UIMA:


a.  Does it sound like a "UIMA kind of problem"? :)

i.e. These "things" I'm trying to understand like
Volume/Chapter/Section/etc. - would you call those "entities" in the way
I've heard the term "entity extraction"?


b.  And I gave so much detail so I could also ask:  Does this sound like a
straightforward use for UIMA, or does it sound like a *difficult* use for
UIMA?


c.  Regarding b, I can imagine me giving UIMA regular expressions to look
for "Chapter (.*): (.*)" kind of stuff, or giving it lists ahead of time
like of the chapters I know the book has (this is the idea of a "Gazeteer"
yes?), but I'm unclear:  does UIMA also address this thing where I'm trying
to understand "what *contains* what"?


d.  i.e. Does UIMA support the need to look at the relationship between
things e.g. "does this heading follow another heading, and was that other
heading identified as a "Section", and is this heading indented further to
the right than that one, so I guess this must be a "Part" within that
"Section".  Does UIMA support that kind of thing?  If so does that have a
name I can search on? :)


e.  When I mentioned the slight inconsistencies in how things are referenced
(the case being different, a dash being omitted, etc). I think I've heard
the phrase "fuzzy matching".  I'm guessing that's part of what UIMA
provides?


Thanks for any tips I apologize for such a long question I'd been looking at
the UIMA docs but I was new enough I decided I needed to appeal to those of
you with greater experience. :)


(is there any kind of "Text Extraction for Dummies" kind of introduction
anybody would recommend for a newbie btw?)


Thanks again,


Darren

Re: UIMA for extracting book "entities" from tables of contents, etc. as RDF?

Posted by Tommaso Teofili <to...@gmail.com>.

Hi Ted,
thanks for your comments!
Regarding differences between DictionaryAnnotator and ConceptMapper there is
a previous thread that should help understanding such comparison [1].

2010/12/27 Ted Pedersen <tp...@d.umn.edu>

> Anyway, assuming that I specify entities using both Regular
> Expressions and Dictionary entries, is there a preferred way to use
> and/or combine the above (or anything else?) The goal at this point is
> simply to identify those entities in text for later downstream
> processing.
>

You probably have to put the "dictionary" analysis engine (be DA or CM) in
the pipeline along with the RegularExpression Annotator and then combine the
generated annotations inside a third custom annotator or via the
Configurable Feature Extractor.
Note that you can build also named entities recognition blocks using OpenNLP
(see, for example, [2]) with existing models or creating your own ones.
Hope this helps.
Cheers,
Tommaso

[1] : http://markmail.org/thread/oyhct2lh4uj2ow2h
[2] :
http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Name_Finder



>
> Thanks!
> Ted
>
> On Mon, Dec 27, 2010 at 9:59 AM, Ted Pedersen <tp...@d.umn.edu> wrote:
> > Thanks to Tommaso for a very interesting posting, and to Darren for
> > the question that generated it.
> >
> > As a kind of follow-on question to one of the suggestions made by
> Tommaso....
> >
> > I'm particularly interested in the functionality provided by Concept
> > Mapper, or maybe Dictionary Annotator (that is having the ability to
> > create a dictionary and then be able to recognize when a dictionary
> > term occurs in my text). From reading over the documentation it seems
> > like Concept Mapper and Dictionary Annotator are fairly similar. To be
> > honest I don't know much about UIMA, but am trying to learn, so there
> > might be some subtleties here I don't see (that would make one want to
> > prefer one of these over the other).
> >
> > Is there a short summary of the differences between Concept Mapper and
> > Dictionary Annotator, and does anyone have any strong feelings about
> > when you should use one over the other?
> >
> > Cordially,
> > Ted
> >
> > On Mon, Dec 27, 2010 at 2:45 AM, Tommaso Teofili
> > <to...@gmail.com> wrote:
> >> Hi Darren,
> >>
> >> 2010/12/23 Darren Cruse <da...@gmail.com>
> >>
> >>> Hi guys I apologize for a newbie question but I'm quite new to UIMA and
> the
> >>> whole area of information extraction/entity extraction.  And I'm hoping
> >>> someone can tell me if UIMA is a proper tool for a project that I've
> been
> >>> working on (with other tools) that I've been having trouble with.
> >>>
> >>>
> >>> Basically the task is to extract meta data from html in the form of
> RDF.
> >>>  Where the html represents books/articles/papers/etc. that typically
> have
> >>> an
> >>> "outline" or "table of contents", and part of the task involves
> extracting
> >>> the entities "behind" (so to speak) the table of contents.
> >>>
> >>
> >> this is perfectly aligned to UIMA scope as it deals with to discovering
> >> hidden knowledge
> >>
> >>
> >>>
> >>>
> >>> So e.g. if the "corpus" of html pages are from a book, and the book has
> >>> Volume 1 and Volume 2, Volume 1 has Chapters 1-18, Chapter 1 has 6
> >>> Sections,
> >>> Section 1 has three Parts, etc.  Then my resulting RDF has to model
> these
> >>> things (entities/classes/whatever you'd call them) and understand the
> >>> "hierarchy" of what contains what.
> >>>
> >>>
> >>> The real challenging part is that it's a pretty large volume of
> material
> >>> with many different books/articles/papers/etc.  And there is a lot of
> >>> variability, as each were authored by different people not following
> any
> >>> particular template.
> >>>
> >>
> >> On the "large volume of material" topic I think that UIMA-AS [1] can
> help
> >> you as you need to scale.
> >>
> >>
> >>>
> >>>
> >>> For example what I called a "table of contents" is rarely a single page
> but
> >>> more often it's exploded across multiple "outline" pages where e.g. a
> high
> >>> level table of contents page goes to the level of chapter links.  And
> then
> >>> each chapter may have it's own "outline" breaking down the sections
> within
> >>> that chapter.  Or it might not, different books can differ.  For
> example
> >>> the
> >>> pages making up the chapter may just have headings referring to the
> >>> titles/names of the sections without being organized into a chapter
> >>> "outline" at all.  Yet I'm still responsible for identifying what the
> >>> sections are.
> >>>
> >>>
> >>> Somewhat helpful is that headings often indicate the kind of thing they
> >>> are,
> >>> e.g. "Section 3:  The Life of the Spleen, Wrap-Up".  Not always though,
> >>> e.g.
> >>> I may only get the "The Life of the Spleen, Wrap-Up" part (without
> "Section
> >>> 3:" on the front).
> >>>
> >>>
> >>> Or I may get both forms in different places in the book, where ideally
> I
> >>> should relate the two references as being the same thing.
> >>>
> >>>
> >>> And where different places can refer to the same thing with other
> >>> differences too.  Possibly the case of the letters differ, or in this
> >>> example there could be one heading with "Wrap-Up" and another with
>  "Wrap
> >>> Up" (one with the dash the other without the dash).
> >>>
> >>>
> >>> As far as understanding the relationships between things i.e. that
> Chapter
> >>> 3
> >>> contains Sections 1 through 3 and Section 1 contains two "Parts", where
> the
> >>> things do appear in a "table of contents" or "outline" page, it seems
> like
> >>> the arrangement/formatting of those pages give the clue as to "what
> >>> contains
> >>> what".  i.e. Things "contained" typically follow what they're contained
> by,
> >>> and are often indented (but not necessarily, it can just be that the
> >>> "parent" is bolded, yet they might not be indented beneath their
> "parent").
> >>>
> >>>
> >>>
> >>> Apologize for the long winded description but hopefully it will help to
> >>> clarify my question since I'm new to UIMA:
> >>>
> >>>
> >>> a.  Does it sound like a "UIMA kind of problem"? :)
> >>>
> >>
> >> I recently on a similar use case and yes I think this sounds a UIMA kind
> of
> >> problem.
> >> My very abstract advice is to use a bottom-up approach, that is
> recognize
> >> words, then sentences, then sections at first; after that you can "play"
> >> with sections and understand relationships with chapters and so on.
> >>
> >>
> >>>
> >>> i.e. These "things" I'm trying to understand like
> >>> Volume/Chapter/Section/etc. - would you call those "entities" in the
> way
> >>> I've heard the term "entity extraction"?
> >>>
> >>>
> >>> b.  And I gave so much detail so I could also ask:  Does this sound
> like a
> >>> straightforward use for UIMA, or does it sound like a *difficult* use
> for
> >>> UIMA?
> >>>
> >>
> >> it sounds to me a straightforward use of UIMA but this doesn't mean
> it'll be
> >> that easy :)
> >>
> >>
> >>>
> >>>
> >>> c.  Regarding b, I can imagine me giving UIMA regular expressions to
> look
> >>> for "Chapter (.*): (.*)" kind of stuff, or giving it lists ahead of
> time
> >>> like of the chapters I know the book has (this is the idea of a
> "Gazeteer"
> >>> yes?), but I'm unclear:  does UIMA also address this thing where I'm
> trying
> >>> to understand "what *contains* what"?
> >>>
> >>
> >> I'd recommend regular expressions as latest thing to rely on, as they
> are
> >> not so easy to maintain along time and also not so efficient; however
> they
> >> can really help sometimes.
> >> I'd go through simple NLP phases as tokenizing and POS tagging along
> with
> >> "Gazeteers" (see DictionaryAnnotator[2] and ConceptMapper[3]) and maybe
> >> introducing OpenNLP[4] tools to use chunkers.
> >>
> >>
> >>>
> >>>
> >>> d.  i.e. Does UIMA support the need to look at the relationship between
> >>> things e.g. "does this heading follow another heading, and was that
> other
> >>> heading identified as a "Section", and is this heading indented further
> to
> >>> the right than that one, so I guess this must be a "Part" within that
> >>> "Section".  Does UIMA support that kind of thing?  If so does that have
> a
> >>> name I can search on? :)
> >>>
> >>
> >> What you have to do to support that in UIMA is define some annotator
> that
> >> recognize headings creating, for example, HeadingAnnotations and then
> use,
> >> for example, the ConfigurableFeatureExtractor[5] to see what follows
> what
> >> and those kind of things.
> >>
> >>
> >>
> >>>
> >>>
> >>> e.  When I mentioned the slight inconsistencies in how things are
> >>> referenced
> >>> (the case being different, a dash being omitted, etc). I think I've
> heard
> >>> the phrase "fuzzy matching".  I'm guessing that's part of what UIMA
> >>> provides?
> >>>
> >>
> >> "fuzzy matching" is more likely to be part of IR systems (as
> Lucene/Solr)
> >> however you can place your own tokenizer to parse text as you need; in
> UIMA
> >> you can get the simple tokenizer and place also the stemmer block
> >> (SnowballAnnotator[6]) in the pipeline to get "matches" only on radix of
> a
> >> word.
> >>
> >>
> >>>
> >>>
> >>> Thanks for any tips I apologize for such a long question I'd been
> looking
> >>> at
> >>> the UIMA docs but I was new enough I decided I needed to appeal to
> those of
> >>> you with greater experience. :)
> >>>
> >>
> >> Finally regarding RDF there is not an RDF CAS consumer in UIMA but it
> can be
> >> simply built using Apache Clerezza UIMA Utils module[7]; I'll write a
> >> separate email about this as soon as possible.
> >>
> >> Thanks to you, hope my small hints can help you.
> >> Cheers,
> >> Tommaso
> >>
> >> [1] : http://uima.apache.org/doc-uimaas-what.html
> >> [2] : http://uima.apache.org/sandbox.html#dict.annotator
> >> [3] : http://uima.apache.org/sandbox.html#concept.mapper.annotator
> >> [4] : http://incubator.apache.org/opennlp/
> >> [5] :
> >>
> http://uima.apache.org/sandbox.html#configurable.feature.extractor.annotator
> >> [6] : http://uima.apache.org/sandbox.html#snowball.annotator
> >> [7] :
> >>
> http://svn.apache.org/repos/asf/incubator/clerezza/trunk/org.apache.clerezza.parent/org.apache.clerezza.uima/org.apache.clerezza.uima.utils/
> >>
> >>
> >>
> >>
> >>
> >>>
> >>>
> >>> (is there any kind of "Text Extraction for Dummies" kind of
> introduction
> >>> anybody would recommend for a newbie btw?)
> >>>
> >>>
> >>> Thanks again,
> >>>
> >>>
> >>> Darren
> >>>
> >>
> >
> >
> >
> > --
> > Ted Pedersen
> > http://www.d.umn.edu/~tpederse
> >
>
>
>
> --
> Ted Pedersen
> http://www.d.umn.edu/~tpederse
>

Re: UIMA for extracting book "entities" from tables of contents, etc. as RDF?

Posted by Ted Pedersen <tp...@d.umn.edu>.

BTW, one potential consideration in this is that in addition to
providing a dictionary of terms (as Dictionary Annotator and Concept
Mapper seem to provide), I'm also interested in providing regular
expressions that can be matched in my text. So I will have entities
that I want to identify that might occur in a dictionary, or might be
defined by a regular expression. I guess this must be pretty common,
but I'm wondering if either Dictionary Annotator or Concept Mapper
integrate better with Regular Expression Annotator?

In case I'm not being clear about what I'm referring to...

Regular Expression Annotator
http://uima.apache.org/downloads/sandbox/RegexAnnotatorUserGuide/RegexAnnotatorUserGuide.html#sandbox.regexAnnotator.conceptsFile.concepts

Dictionary Annotator
http://uima.apache.org/downloads/sandbox/DictionaryAnnotatorUserGuide/DictionaryAnnotatorUserGuide.html

Concept Mapper
http://uima.apache.org/downloads/sandbox/ConceptMapperAnnotatorUserGuide/ConceptMapperAnnotatorUserGuide.html

Anyway, assuming that I specify entities using both Regular
Expressions and Dictionary entries, is there a preferred way to use
and/or combine the above (or anything else?) The goal at this point is
simply to identify those entities in text for later downstream
processing.

Thanks!
Ted

On Mon, Dec 27, 2010 at 9:59 AM, Ted Pedersen <tp...@d.umn.edu> wrote:
> Thanks to Tommaso for a very interesting posting, and to Darren for
> the question that generated it.
>
> As a kind of follow-on question to one of the suggestions made by Tommaso....
>
> I'm particularly interested in the functionality provided by Concept
> Mapper, or maybe Dictionary Annotator (that is having the ability to
> create a dictionary and then be able to recognize when a dictionary
> term occurs in my text). From reading over the documentation it seems
> like Concept Mapper and Dictionary Annotator are fairly similar. To be
> honest I don't know much about UIMA, but am trying to learn, so there
> might be some subtleties here I don't see (that would make one want to
> prefer one of these over the other).
>
> Is there a short summary of the differences between Concept Mapper and
> Dictionary Annotator, and does anyone have any strong feelings about
> when you should use one over the other?
>
> Cordially,
> Ted
>
> On Mon, Dec 27, 2010 at 2:45 AM, Tommaso Teofili
> <to...@gmail.com> wrote:
>> Hi Darren,
>>
>> 2010/12/23 Darren Cruse <da...@gmail.com>
>>
>>> Hi guys I apologize for a newbie question but I'm quite new to UIMA and the
>>> whole area of information extraction/entity extraction.  And I'm hoping
>>> someone can tell me if UIMA is a proper tool for a project that I've been
>>> working on (with other tools) that I've been having trouble with.
>>>
>>>
>>> Basically the task is to extract meta data from html in the form of RDF.
>>>  Where the html represents books/articles/papers/etc. that typically have
>>> an
>>> "outline" or "table of contents", and part of the task involves extracting
>>> the entities "behind" (so to speak) the table of contents.
>>>
>>
>> this is perfectly aligned to UIMA scope as it deals with to discovering
>> hidden knowledge
>>
>>
>>>
>>>
>>> So e.g. if the "corpus" of html pages are from a book, and the book has
>>> Volume 1 and Volume 2, Volume 1 has Chapters 1-18, Chapter 1 has 6
>>> Sections,
>>> Section 1 has three Parts, etc.  Then my resulting RDF has to model these
>>> things (entities/classes/whatever you'd call them) and understand the
>>> "hierarchy" of what contains what.
>>>
>>>
>>> The real challenging part is that it's a pretty large volume of material
>>> with many different books/articles/papers/etc.  And there is a lot of
>>> variability, as each were authored by different people not following any
>>> particular template.
>>>
>>
>> On the "large volume of material" topic I think that UIMA-AS [1] can help
>> you as you need to scale.
>>
>>
>>>
>>>
>>> For example what I called a "table of contents" is rarely a single page but
>>> more often it's exploded across multiple "outline" pages where e.g. a high
>>> level table of contents page goes to the level of chapter links.  And then
>>> each chapter may have it's own "outline" breaking down the sections within
>>> that chapter.  Or it might not, different books can differ.  For example
>>> the
>>> pages making up the chapter may just have headings referring to the
>>> titles/names of the sections without being organized into a chapter
>>> "outline" at all.  Yet I'm still responsible for identifying what the
>>> sections are.
>>>
>>>
>>> Somewhat helpful is that headings often indicate the kind of thing they
>>> are,
>>> e.g. "Section 3:  The Life of the Spleen, Wrap-Up".  Not always though,
>>> e.g.
>>> I may only get the "The Life of the Spleen, Wrap-Up" part (without "Section
>>> 3:" on the front).
>>>
>>>
>>> Or I may get both forms in different places in the book, where ideally I
>>> should relate the two references as being the same thing.
>>>
>>>
>>> And where different places can refer to the same thing with other
>>> differences too.  Possibly the case of the letters differ, or in this
>>> example there could be one heading with "Wrap-Up" and another with  "Wrap
>>> Up" (one with the dash the other without the dash).
>>>
>>>
>>> As far as understanding the relationships between things i.e. that Chapter
>>> 3
>>> contains Sections 1 through 3 and Section 1 contains two "Parts", where the
>>> things do appear in a "table of contents" or "outline" page, it seems like
>>> the arrangement/formatting of those pages give the clue as to "what
>>> contains
>>> what".  i.e. Things "contained" typically follow what they're contained by,
>>> and are often indented (but not necessarily, it can just be that the
>>> "parent" is bolded, yet they might not be indented beneath their "parent").
>>>
>>>
>>>
>>> Apologize for the long winded description but hopefully it will help to
>>> clarify my question since I'm new to UIMA:
>>>
>>>
>>> a.  Does it sound like a "UIMA kind of problem"? :)
>>>
>>
>> I recently on a similar use case and yes I think this sounds a UIMA kind of
>> problem.
>> My very abstract advice is to use a bottom-up approach, that is recognize
>> words, then sentences, then sections at first; after that you can "play"
>> with sections and understand relationships with chapters and so on.
>>
>>
>>>
>>> i.e. These "things" I'm trying to understand like
>>> Volume/Chapter/Section/etc. - would you call those "entities" in the way
>>> I've heard the term "entity extraction"?
>>>
>>>
>>> b.  And I gave so much detail so I could also ask:  Does this sound like a
>>> straightforward use for UIMA, or does it sound like a *difficult* use for
>>> UIMA?
>>>
>>
>> it sounds to me a straightforward use of UIMA but this doesn't mean it'll be
>> that easy :)
>>
>>
>>>
>>>
>>> c.  Regarding b, I can imagine me giving UIMA regular expressions to look
>>> for "Chapter (.*): (.*)" kind of stuff, or giving it lists ahead of time
>>> like of the chapters I know the book has (this is the idea of a "Gazeteer"
>>> yes?), but I'm unclear:  does UIMA also address this thing where I'm trying
>>> to understand "what *contains* what"?
>>>
>>
>> I'd recommend regular expressions as latest thing to rely on, as they are
>> not so easy to maintain along time and also not so efficient; however they
>> can really help sometimes.
>> I'd go through simple NLP phases as tokenizing and POS tagging along with
>> "Gazeteers" (see DictionaryAnnotator[2] and ConceptMapper[3]) and maybe
>> introducing OpenNLP[4] tools to use chunkers.
>>
>>
>>>
>>>
>>> d.  i.e. Does UIMA support the need to look at the relationship between
>>> things e.g. "does this heading follow another heading, and was that other
>>> heading identified as a "Section", and is this heading indented further to
>>> the right than that one, so I guess this must be a "Part" within that
>>> "Section".  Does UIMA support that kind of thing?  If so does that have a
>>> name I can search on? :)
>>>
>>
>> What you have to do to support that in UIMA is define some annotator that
>> recognize headings creating, for example, HeadingAnnotations and then use,
>> for example, the ConfigurableFeatureExtractor[5] to see what follows what
>> and those kind of things.
>>
>>
>>
>>>
>>>
>>> e.  When I mentioned the slight inconsistencies in how things are
>>> referenced
>>> (the case being different, a dash being omitted, etc). I think I've heard
>>> the phrase "fuzzy matching".  I'm guessing that's part of what UIMA
>>> provides?
>>>
>>
>> "fuzzy matching" is more likely to be part of IR systems (as Lucene/Solr)
>> however you can place your own tokenizer to parse text as you need; in UIMA
>> you can get the simple tokenizer and place also the stemmer block
>> (SnowballAnnotator[6]) in the pipeline to get "matches" only on radix of a
>> word.
>>
>>
>>>
>>>
>>> Thanks for any tips I apologize for such a long question I'd been looking
>>> at
>>> the UIMA docs but I was new enough I decided I needed to appeal to those of
>>> you with greater experience. :)
>>>
>>
>> Finally regarding RDF there is not an RDF CAS consumer in UIMA but it can be
>> simply built using Apache Clerezza UIMA Utils module[7]; I'll write a
>> separate email about this as soon as possible.
>>
>> Thanks to you, hope my small hints can help you.
>> Cheers,
>> Tommaso
>>
>> [1] : http://uima.apache.org/doc-uimaas-what.html
>> [2] : http://uima.apache.org/sandbox.html#dict.annotator
>> [3] : http://uima.apache.org/sandbox.html#concept.mapper.annotator
>> [4] : http://incubator.apache.org/opennlp/
>> [5] :
>> http://uima.apache.org/sandbox.html#configurable.feature.extractor.annotator
>> [6] : http://uima.apache.org/sandbox.html#snowball.annotator
>> [7] :
>> http://svn.apache.org/repos/asf/incubator/clerezza/trunk/org.apache.clerezza.parent/org.apache.clerezza.uima/org.apache.clerezza.uima.utils/
>>
>>
>>
>>
>>
>>>
>>>
>>> (is there any kind of "Text Extraction for Dummies" kind of introduction
>>> anybody would recommend for a newbie btw?)
>>>
>>>
>>> Thanks again,
>>>
>>>
>>> Darren
>>>
>>
>
>
>
> --
> Ted Pedersen
> http://www.d.umn.edu/~tpederse
>



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Re: UIMA for extracting book "entities" from tables of contents, etc. as RDF?

Posted by Ted Pedersen <tp...@d.umn.edu>.

Thanks to Tommaso for a very interesting posting, and to Darren for
the question that generated it.

As a kind of follow-on question to one of the suggestions made by Tommaso....

I'm particularly interested in the functionality provided by Concept
Mapper, or maybe Dictionary Annotator (that is having the ability to
create a dictionary and then be able to recognize when a dictionary
term occurs in my text). From reading over the documentation it seems
like Concept Mapper and Dictionary Annotator are fairly similar. To be
honest I don't know much about UIMA, but am trying to learn, so there
might be some subtleties here I don't see (that would make one want to
prefer one of these over the other).

Is there a short summary of the differences between Concept Mapper and
Dictionary Annotator, and does anyone have any strong feelings about
when you should use one over the other?

Cordially,
Ted

On Mon, Dec 27, 2010 at 2:45 AM, Tommaso Teofili
<to...@gmail.com> wrote:
> Hi Darren,
>
> 2010/12/23 Darren Cruse <da...@gmail.com>
>
>> Hi guys I apologize for a newbie question but I'm quite new to UIMA and the
>> whole area of information extraction/entity extraction.  And I'm hoping
>> someone can tell me if UIMA is a proper tool for a project that I've been
>> working on (with other tools) that I've been having trouble with.
>>
>>
>> Basically the task is to extract meta data from html in the form of RDF.
>>  Where the html represents books/articles/papers/etc. that typically have
>> an
>> "outline" or "table of contents", and part of the task involves extracting
>> the entities "behind" (so to speak) the table of contents.
>>
>
> this is perfectly aligned to UIMA scope as it deals with to discovering
> hidden knowledge
>
>
>>
>>
>> So e.g. if the "corpus" of html pages are from a book, and the book has
>> Volume 1 and Volume 2, Volume 1 has Chapters 1-18, Chapter 1 has 6
>> Sections,
>> Section 1 has three Parts, etc.  Then my resulting RDF has to model these
>> things (entities/classes/whatever you'd call them) and understand the
>> "hierarchy" of what contains what.
>>
>>
>> The real challenging part is that it's a pretty large volume of material
>> with many different books/articles/papers/etc.  And there is a lot of
>> variability, as each were authored by different people not following any
>> particular template.
>>
>
> On the "large volume of material" topic I think that UIMA-AS [1] can help
> you as you need to scale.
>
>
>>
>>
>> For example what I called a "table of contents" is rarely a single page but
>> more often it's exploded across multiple "outline" pages where e.g. a high
>> level table of contents page goes to the level of chapter links.  And then
>> each chapter may have it's own "outline" breaking down the sections within
>> that chapter.  Or it might not, different books can differ.  For example
>> the
>> pages making up the chapter may just have headings referring to the
>> titles/names of the sections without being organized into a chapter
>> "outline" at all.  Yet I'm still responsible for identifying what the
>> sections are.
>>
>>
>> Somewhat helpful is that headings often indicate the kind of thing they
>> are,
>> e.g. "Section 3:  The Life of the Spleen, Wrap-Up".  Not always though,
>> e.g.
>> I may only get the "The Life of the Spleen, Wrap-Up" part (without "Section
>> 3:" on the front).
>>
>>
>> Or I may get both forms in different places in the book, where ideally I
>> should relate the two references as being the same thing.
>>
>>
>> And where different places can refer to the same thing with other
>> differences too.  Possibly the case of the letters differ, or in this
>> example there could be one heading with "Wrap-Up" and another with  "Wrap
>> Up" (one with the dash the other without the dash).
>>
>>
>> As far as understanding the relationships between things i.e. that Chapter
>> 3
>> contains Sections 1 through 3 and Section 1 contains two "Parts", where the
>> things do appear in a "table of contents" or "outline" page, it seems like
>> the arrangement/formatting of those pages give the clue as to "what
>> contains
>> what".  i.e. Things "contained" typically follow what they're contained by,
>> and are often indented (but not necessarily, it can just be that the
>> "parent" is bolded, yet they might not be indented beneath their "parent").
>>
>>
>>
>> Apologize for the long winded description but hopefully it will help to
>> clarify my question since I'm new to UIMA:
>>
>>
>> a.  Does it sound like a "UIMA kind of problem"? :)
>>
>
> I recently on a similar use case and yes I think this sounds a UIMA kind of
> problem.
> My very abstract advice is to use a bottom-up approach, that is recognize
> words, then sentences, then sections at first; after that you can "play"
> with sections and understand relationships with chapters and so on.
>
>
>>
>> i.e. These "things" I'm trying to understand like
>> Volume/Chapter/Section/etc. - would you call those "entities" in the way
>> I've heard the term "entity extraction"?
>>
>>
>> b.  And I gave so much detail so I could also ask:  Does this sound like a
>> straightforward use for UIMA, or does it sound like a *difficult* use for
>> UIMA?
>>
>
> it sounds to me a straightforward use of UIMA but this doesn't mean it'll be
> that easy :)
>
>
>>
>>
>> c.  Regarding b, I can imagine me giving UIMA regular expressions to look
>> for "Chapter (.*): (.*)" kind of stuff, or giving it lists ahead of time
>> like of the chapters I know the book has (this is the idea of a "Gazeteer"
>> yes?), but I'm unclear:  does UIMA also address this thing where I'm trying
>> to understand "what *contains* what"?
>>
>
> I'd recommend regular expressions as latest thing to rely on, as they are
> not so easy to maintain along time and also not so efficient; however they
> can really help sometimes.
> I'd go through simple NLP phases as tokenizing and POS tagging along with
> "Gazeteers" (see DictionaryAnnotator[2] and ConceptMapper[3]) and maybe
> introducing OpenNLP[4] tools to use chunkers.
>
>
>>
>>
>> d.  i.e. Does UIMA support the need to look at the relationship between
>> things e.g. "does this heading follow another heading, and was that other
>> heading identified as a "Section", and is this heading indented further to
>> the right than that one, so I guess this must be a "Part" within that
>> "Section".  Does UIMA support that kind of thing?  If so does that have a
>> name I can search on? :)
>>
>
> What you have to do to support that in UIMA is define some annotator that
> recognize headings creating, for example, HeadingAnnotations and then use,
> for example, the ConfigurableFeatureExtractor[5] to see what follows what
> and those kind of things.
>
>
>
>>
>>
>> e.  When I mentioned the slight inconsistencies in how things are
>> referenced
>> (the case being different, a dash being omitted, etc). I think I've heard
>> the phrase "fuzzy matching".  I'm guessing that's part of what UIMA
>> provides?
>>
>
> "fuzzy matching" is more likely to be part of IR systems (as Lucene/Solr)
> however you can place your own tokenizer to parse text as you need; in UIMA
> you can get the simple tokenizer and place also the stemmer block
> (SnowballAnnotator[6]) in the pipeline to get "matches" only on radix of a
> word.
>
>
>>
>>
>> Thanks for any tips I apologize for such a long question I'd been looking
>> at
>> the UIMA docs but I was new enough I decided I needed to appeal to those of
>> you with greater experience. :)
>>
>
> Finally regarding RDF there is not an RDF CAS consumer in UIMA but it can be
> simply built using Apache Clerezza UIMA Utils module[7]; I'll write a
> separate email about this as soon as possible.
>
> Thanks to you, hope my small hints can help you.
> Cheers,
> Tommaso
>
> [1] : http://uima.apache.org/doc-uimaas-what.html
> [2] : http://uima.apache.org/sandbox.html#dict.annotator
> [3] : http://uima.apache.org/sandbox.html#concept.mapper.annotator
> [4] : http://incubator.apache.org/opennlp/
> [5] :
> http://uima.apache.org/sandbox.html#configurable.feature.extractor.annotator
> [6] : http://uima.apache.org/sandbox.html#snowball.annotator
> [7] :
> http://svn.apache.org/repos/asf/incubator/clerezza/trunk/org.apache.clerezza.parent/org.apache.clerezza.uima/org.apache.clerezza.uima.utils/
>
>
>
>
>
>>
>>
>> (is there any kind of "Text Extraction for Dummies" kind of introduction
>> anybody would recommend for a newbie btw?)
>>
>>
>> Thanks again,
>>
>>
>> Darren
>>
>



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Re: UIMA for extracting book "entities" from tables of contents, etc. as RDF?

Posted by Tommaso Teofili <to...@gmail.com>.

Hi Darren,

2010/12/23 Darren Cruse <da...@gmail.com>

> Hi guys I apologize for a newbie question but I'm quite new to UIMA and the
> whole area of information extraction/entity extraction.  And I'm hoping
> someone can tell me if UIMA is a proper tool for a project that I've been
> working on (with other tools) that I've been having trouble with.
>
>
> Basically the task is to extract meta data from html in the form of RDF.
>  Where the html represents books/articles/papers/etc. that typically have
> an
> "outline" or "table of contents", and part of the task involves extracting
> the entities "behind" (so to speak) the table of contents.
>

this is perfectly aligned to UIMA scope as it deals with to discovering
hidden knowledge


>
>
> So e.g. if the "corpus" of html pages are from a book, and the book has
> Volume 1 and Volume 2, Volume 1 has Chapters 1-18, Chapter 1 has 6
> Sections,
> Section 1 has three Parts, etc.  Then my resulting RDF has to model these
> things (entities/classes/whatever you'd call them) and understand the
> "hierarchy" of what contains what.
>
>
> The real challenging part is that it's a pretty large volume of material
> with many different books/articles/papers/etc.  And there is a lot of
> variability, as each were authored by different people not following any
> particular template.
>

On the "large volume of material" topic I think that UIMA-AS [1] can help
you as you need to scale.


>
>
> For example what I called a "table of contents" is rarely a single page but
> more often it's exploded across multiple "outline" pages where e.g. a high
> level table of contents page goes to the level of chapter links.  And then
> each chapter may have it's own "outline" breaking down the sections within
> that chapter.  Or it might not, different books can differ.  For example
> the
> pages making up the chapter may just have headings referring to the
> titles/names of the sections without being organized into a chapter
> "outline" at all.  Yet I'm still responsible for identifying what the
> sections are.
>
>
> Somewhat helpful is that headings often indicate the kind of thing they
> are,
> e.g. "Section 3:  The Life of the Spleen, Wrap-Up".  Not always though,
> e.g.
> I may only get the "The Life of the Spleen, Wrap-Up" part (without "Section
> 3:" on the front).
>
>
> Or I may get both forms in different places in the book, where ideally I
> should relate the two references as being the same thing.
>
>
> And where different places can refer to the same thing with other
> differences too.  Possibly the case of the letters differ, or in this
> example there could be one heading with "Wrap-Up" and another with  "Wrap
> Up" (one with the dash the other without the dash).
>
>
> As far as understanding the relationships between things i.e. that Chapter
> 3
> contains Sections 1 through 3 and Section 1 contains two "Parts", where the
> things do appear in a "table of contents" or "outline" page, it seems like
> the arrangement/formatting of those pages give the clue as to "what
> contains
> what".  i.e. Things "contained" typically follow what they're contained by,
> and are often indented (but not necessarily, it can just be that the
> "parent" is bolded, yet they might not be indented beneath their "parent").
>
>
>
> Apologize for the long winded description but hopefully it will help to
> clarify my question since I'm new to UIMA:
>
>
> a.  Does it sound like a "UIMA kind of problem"? :)
>

I recently on a similar use case and yes I think this sounds a UIMA kind of
problem.
My very abstract advice is to use a bottom-up approach, that is recognize
words, then sentences, then sections at first; after that you can "play"
with sections and understand relationships with chapters and so on.


>
> i.e. These "things" I'm trying to understand like
> Volume/Chapter/Section/etc. - would you call those "entities" in the way
> I've heard the term "entity extraction"?
>
>
> b.  And I gave so much detail so I could also ask:  Does this sound like a
> straightforward use for UIMA, or does it sound like a *difficult* use for
> UIMA?
>

it sounds to me a straightforward use of UIMA but this doesn't mean it'll be
that easy :)


>
>
> c.  Regarding b, I can imagine me giving UIMA regular expressions to look
> for "Chapter (.*): (.*)" kind of stuff, or giving it lists ahead of time
> like of the chapters I know the book has (this is the idea of a "Gazeteer"
> yes?), but I'm unclear:  does UIMA also address this thing where I'm trying
> to understand "what *contains* what"?
>

I'd recommend regular expressions as latest thing to rely on, as they are
not so easy to maintain along time and also not so efficient; however they
can really help sometimes.
I'd go through simple NLP phases as tokenizing and POS tagging along with
"Gazeteers" (see DictionaryAnnotator[2] and ConceptMapper[3]) and maybe
introducing OpenNLP[4] tools to use chunkers.


>
>
> d.  i.e. Does UIMA support the need to look at the relationship between
> things e.g. "does this heading follow another heading, and was that other
> heading identified as a "Section", and is this heading indented further to
> the right than that one, so I guess this must be a "Part" within that
> "Section".  Does UIMA support that kind of thing?  If so does that have a
> name I can search on? :)
>

What you have to do to support that in UIMA is define some annotator that
recognize headings creating, for example, HeadingAnnotations and then use,
for example, the ConfigurableFeatureExtractor[5] to see what follows what
and those kind of things.



>
>
> e.  When I mentioned the slight inconsistencies in how things are
> referenced
> (the case being different, a dash being omitted, etc). I think I've heard
> the phrase "fuzzy matching".  I'm guessing that's part of what UIMA
> provides?
>

"fuzzy matching" is more likely to be part of IR systems (as Lucene/Solr)
however you can place your own tokenizer to parse text as you need; in UIMA
you can get the simple tokenizer and place also the stemmer block
(SnowballAnnotator[6]) in the pipeline to get "matches" only on radix of a
word.


>
>
> Thanks for any tips I apologize for such a long question I'd been looking
> at
> the UIMA docs but I was new enough I decided I needed to appeal to those of
> you with greater experience. :)
>

Finally regarding RDF there is not an RDF CAS consumer in UIMA but it can be
simply built using Apache Clerezza UIMA Utils module[7]; I'll write a
separate email about this as soon as possible.

Thanks to you, hope my small hints can help you.
Cheers,
Tommaso

[1] : http://uima.apache.org/doc-uimaas-what.html
[2] : http://uima.apache.org/sandbox.html#dict.annotator
[3] : http://uima.apache.org/sandbox.html#concept.mapper.annotator
[4] : http://incubator.apache.org/opennlp/
[5] :
http://uima.apache.org/sandbox.html#configurable.feature.extractor.annotator
[6] : http://uima.apache.org/sandbox.html#snowball.annotator
[7] :
http://svn.apache.org/repos/asf/incubator/clerezza/trunk/org.apache.clerezza.parent/org.apache.clerezza.uima/org.apache.clerezza.uima.utils/





>
>
> (is there any kind of "Text Extraction for Dummies" kind of introduction
> anybody would recommend for a newbie btw?)
>
>
> Thanks again,
>
>
> Darren
>