You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Mark Ettinger <je...@gmail.com> on 2009/04/07 03:01:01 UTC

New to NLP and navigating the options.

Hello all,

I am a trained mathematician/computer scientist/programmer jumping into NLP and
excited by the challenge but intimidated by the algorithm and software options.
 Specifically, I am at University of Texas and am charged with putting to good
use our large database of (more-or-less unused) clinical notes.  My strategy is
roughly:

1.  Learn the theory of NLP and Information Extraction.
2.  Understand the publicly available software packages so as to avoid
reinventing the wheel.
3.  Apply #2 to our database and begin experimenting.

My question in this post centers on #2.  Not being a software engineer (though
having lots of scientific programming experience), I am sometimes puzzled by
"frameworks" and "components".  I think of everything as libraries of functions.
 Yes, I know this view is outdated.  I can wrap my head around NLP packages like
Lingpipe and NLTK but am unclear what a package like UIMA offers over and above
these types of pure libraries.  

Given what I've told you about my background (scientist, programmer, but NOT
software engineer) can someone explain to me how investing the time to learn
UIMA will pay off in the long run?  I've started to dig into the UIMA api but
thought I'd throw this rather basic question out there, hoping someone wouldn't
think it too naive for this forum.

Thanks in advance!

Mark Ettinger


Re: New to NLP and navigating the options.

Posted by Marshall Schor <ms...@schor.com>.
Hi Mark,

I don't this is naive at all.  I'll describe a few areas where UIMA adds
value (this is not a complete list :-) ).

UIMA requires that the components declare (in external .XML files)
things like "types" and what types they produce and require as inputs.  

This facilitates different "roles" played by different people with
different skills.  For instance, an annotator writer (someone who writes
components) might be skilled in NLP algorithms and how to write them
efficiently.  Someone else might not know these details, but be better
at "assembling" and "configuring" components, perhaps written by
different people independently, to address a particular need.  This
person would use tooling that makes use of the external .XML files
mentioned above, to do this.  The Component Description Editor tool that
comes with UIMA is an example of this kind of a tool.

Another kind of role is facilitated by UIMA-AS - the "deployer" might
have a configured set of components that is running too slowly - and
after some analysis of the time spent in various parts (facilitated
perhaps by using the "framework" facilities that compute this), might
decide to "scale-out" certain steps / components that are the
bottleneck, running those on multiple machines in a cluster.  The
original component writer knows nothing about these various kinds of
possibilities - the framework is "adding value" by keeping these
concerns separated from each other, and providing "services" that
address them.

The first a-ha experience I had with the framework happened when (in a
class) we were running some sets of components, and then with a few
simple commands were able to have some people in the class run a part of
the pipeline as a "service" on their machine, and others in the class
used those components, without re-writing the components, and without
doing any special coding to make this happen (other than changing some
XML configuration files).  The framework did the work of using current
networking technology to accomplish all this, and the algorithm writer
(who might have been a scientific programmer that wasn't keeping up with
the rapid changes in the world of networking components) didn't need to
know any of the details of this, to take advantage of it.

-Marshall

Mark Ettinger wrote:
> Hello all,
>
> I am a trained mathematician/computer scientist/programmer jumping into NLP and
> excited by the challenge but intimidated by the algorithm and software options.
>  Specifically, I am at University of Texas and am charged with putting to good
> use our large database of (more-or-less unused) clinical notes.  My strategy is
> roughly:
>
> 1.  Learn the theory of NLP and Information Extraction.
> 2.  Understand the publicly available software packages so as to avoid
> reinventing the wheel.
> 3.  Apply #2 to our database and begin experimenting.
>
> My question in this post centers on #2.  Not being a software engineer (though
> having lots of scientific programming experience), I am sometimes puzzled by
> "frameworks" and "components".  I think of everything as libraries of functions.
>  Yes, I know this view is outdated.  I can wrap my head around NLP packages like
> Lingpipe and NLTK but am unclear what a package like UIMA offers over and above
> these types of pure libraries.  
>
> Given what I've told you about my background (scientist, programmer, but NOT
> software engineer) can someone explain to me how investing the time to learn
> UIMA will pay off in the long run?  I've started to dig into the UIMA api but
> thought I'd throw this rather basic question out there, hoping someone wouldn't
> think it too naive for this forum.
>
> Thanks in advance!
>
> Mark Ettinger
>
>
>
>   

RE: New to NLP and navigating the options.

Posted by jo...@thomsonreuters.com.
Mark,

Your three step plan looks good. Here are some answers:

- NLTK is a toolkit aimed at teaching NLP to university students.
  (and there's an O'Reilly book coming out this summer, which you can read for free
  online at http://www.nltk.org/book )
  -> can help you with (1.)

- LingPipe is a toolkit to actually build (Java) systems for particular NLP tasks.
  -> can help you with (2.+3.)

- In NLP, there's two ways of doing things: either you convert between lots of
  idiosyncratic data formats, or you submit to a framework (UIMA, GATE) that manages
  annotations for you (see also the section on Standards in
  http://nltk.googlecode.com/svn/trunk/doc/book/ch11.html ).
  UIMA is such a framework for meta-data (such as annotation of text) that saves you writing
  a lot of conversion code by standardizing one way of storing and handling it.

- the difference between components and frameworks doesn't just confuse you, people
  (wrongly) use the two terms interchangeably, but there's a difference, which is
  discussed in Section 2.6 of an paper of mine:

  Leidner, Jochen L. (2003). Current Issues in Software Engineering for Natural Language
  Processing. Proceedings of the Workshop on Software Engineering and Architecture of
  Language Technology Systems (SEALTS) held at the Joint Conference for Human Language
  Technology and the Annual Meeting of the Noth American Chapter of the Association for
  Computational Linguistics 2003 (HLT/NAACL'03), Edmonton, Alberta, Canada, pp. 45-50.
  http://www.iccs.inf.ed.ac.uk/~jleidner/documents/Leidner-2003-SEALTS.pdf

How can you tell a toolkit from a framework? They say a framework is like Hollywood: "You
can't call us, we call you." If it never calls you, it's probably a toolkit. ;-)

As far as resources at the University of Texas go, try to connect with Jason Baldridge,
who is a professor at U Texas (he's one of the authors of the OpenNLP package).

Regards
Jochen


--
Dr. Jochen Leidner
Research Scientist

Thomson Reuters 
Research & Development
610 Opperman Drive
Eagan, MN 55123
USA

http://www.ThomsonReuters.com

-----Original Message-----
From: news [mailto:news@ger.gmane.org] On Behalf Of Mark Ettinger
Sent: Monday, April 06, 2009 8:01 PM
To: uima-user@incubator.apache.org
Subject: New to NLP and navigating the options.


Hello all,

I am a trained mathematician/computer scientist/programmer jumping into NLP and
excited by the challenge but intimidated by the algorithm and software options.
 Specifically, I am at University of Texas and am charged with putting to good
use our large database of (more-or-less unused) clinical notes.  My strategy is
roughly:

1.  Learn the theory of NLP and Information Extraction.
2.  Understand the publicly available software packages so as to avoid
reinventing the wheel.
3.  Apply #2 to our database and begin experimenting.

My question in this post centers on #2.  Not being a software engineer (though
having lots of scientific programming experience), I am sometimes puzzled by
"frameworks" and "components".  I think of everything as libraries of functions.
 Yes, I know this view is outdated.  I can wrap my head around NLP packages like
Lingpipe and NLTK but am unclear what a package like UIMA offers over and above
these types of pure libraries.  

Given what I've told you about my background (scientist, programmer, but NOT
software engineer) can someone explain to me how investing the time to learn
UIMA will pay off in the long run?  I've started to dig into the UIMA api but
thought I'd throw this rather basic question out there, hoping someone wouldn't
think it too naive for this forum.

Thanks in advance!

Mark Ettinger