You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Lance Norskog <go...@gmail.com> on 2012/05/23 05:29:07 UTC

Finding large parts of large sentences

I would like to take a long sentence, let's say 30 words, and find
clauses maybe 10-20 words long that are somewhat self-contained blocks
of text; complete sentences or nearly. These clauses can be
overlapping. What is a good way to use OpenNLP's tools?

The application is for document summarization via LSA. This technique
needs to operate on coherent statements rather than very long
sentences. Some of my test data is riddled with 30-50-word sentences,
and they have overlapping clauses which are coherent statements of the
document themes.

-- 
Lance Norskog
goksron@gmail.com

Re: Finding large parts of large sentences

Posted by Jason Baldridge <ja...@gmail.com>.
What you want is an elementary discourse unit detector. Here's an example
paper that does this:

aclweb.org/anthology-new/N/N03/N03-1030.pdf

You could indeed do something like a sentence detector for this -- it's
just that it is less obvious where you need to make decisions.

Jason

On Fri, May 25, 2012 at 7:09 PM, Lance Norskog <go...@gmail.com> wrote:

> Right. I was thinking the Chunker against chunked text. Meta-chunking
> :) The chunker is designed for a small vocabulary while NER is
> designed for a large vocabulary. The Chunker is really slow,
> meta-chunking I'm sure slower. Maybe the sentence parser?
>
> The application does not have to be all that correct. If the tree
> parse (A(B(C),D) where these are clauses in order, CD would be fine.
> Overlapping sub-sentences is the goal.
>
> On Wed, May 23, 2012 at 12:04 AM, Svetoslav Marinov
> <sv...@findwise.com> wrote:
> > Take the longest NP chunks? There are NP chunker models for English.
> > The results from the English NP chunker are quite granular so maybe the
> > length (about 30 words) should steer this.
> >
> > Alternatively, you can use the parser and get the longest Nps there as
> > well which are children of a VP. Maybe also start with the very basic NP
> > VP NP construction from the parse tree. This should, hopefully, give
> > meaningful clauses.
> >
> > And then, probably a weird idea is to mimic a NER system. Just use the
> > input from a POS tagger in connection with a RegEx NER finder. Your regex
> > will work on POS sequences (e.g. DT JJ* NP).
> >
> > Hope this helps.
> >
> > Best,
> >
> > Svetoslav
> >
> >
> > On 2012-05-23 05:29, "Lance Norskog" <go...@gmail.com> wrote:
> >
> >>I would like to take a long sentence, let's say 30 words, and find
> >>clauses maybe 10-20 words long that are somewhat self-contained blocks
> >>of text; complete sentences or nearly. These clauses can be
> >>overlapping. What is a good way to use OpenNLP's tools?
> >>
> >>The application is for document summarization via LSA. This technique
> >>needs to operate on coherent statements rather than very long
> >>sentences. Some of my test data is riddled with 30-50-word sentences,
> >>and they have overlapping clauses which are coherent statements of the
> >>document themes.
> >>
> >>--
> >>Lance Norskog
> >>goksron@gmail.com
> >>
> >
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>



-- 
Jason Baldridge
Associate Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge

Re: Finding large parts of large sentences

Posted by Lance Norskog <go...@gmail.com>.
Right. I was thinking the Chunker against chunked text. Meta-chunking
:) The chunker is designed for a small vocabulary while NER is
designed for a large vocabulary. The Chunker is really slow,
meta-chunking I'm sure slower. Maybe the sentence parser?

The application does not have to be all that correct. If the tree
parse (A(B(C),D) where these are clauses in order, CD would be fine.
Overlapping sub-sentences is the goal.

On Wed, May 23, 2012 at 12:04 AM, Svetoslav Marinov
<sv...@findwise.com> wrote:
> Take the longest NP chunks? There are NP chunker models for English.
> The results from the English NP chunker are quite granular so maybe the
> length (about 30 words) should steer this.
>
> Alternatively, you can use the parser and get the longest Nps there as
> well which are children of a VP. Maybe also start with the very basic NP
> VP NP construction from the parse tree. This should, hopefully, give
> meaningful clauses.
>
> And then, probably a weird idea is to mimic a NER system. Just use the
> input from a POS tagger in connection with a RegEx NER finder. Your regex
> will work on POS sequences (e.g. DT JJ* NP).
>
> Hope this helps.
>
> Best,
>
> Svetoslav
>
>
> On 2012-05-23 05:29, "Lance Norskog" <go...@gmail.com> wrote:
>
>>I would like to take a long sentence, let's say 30 words, and find
>>clauses maybe 10-20 words long that are somewhat self-contained blocks
>>of text; complete sentences or nearly. These clauses can be
>>overlapping. What is a good way to use OpenNLP's tools?
>>
>>The application is for document summarization via LSA. This technique
>>needs to operate on coherent statements rather than very long
>>sentences. Some of my test data is riddled with 30-50-word sentences,
>>and they have overlapping clauses which are coherent statements of the
>>document themes.
>>
>>--
>>Lance Norskog
>>goksron@gmail.com
>>
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Finding large parts of large sentences

Posted by Svetoslav Marinov <sv...@findwise.com>.
Take the longest NP chunks? There are NP chunker models for English.
The results from the English NP chunker are quite granular so maybe the
length (about 30 words) should steer this.

Alternatively, you can use the parser and get the longest Nps there as
well which are children of a VP. Maybe also start with the very basic NP
VP NP construction from the parse tree. This should, hopefully, give
meaningful clauses.

And then, probably a weird idea is to mimic a NER system. Just use the
input from a POS tagger in connection with a RegEx NER finder. Your regex
will work on POS sequences (e.g. DT JJ* NP).

Hope this helps.

Best,

Svetoslav


On 2012-05-23 05:29, "Lance Norskog" <go...@gmail.com> wrote:

>I would like to take a long sentence, let's say 30 words, and find
>clauses maybe 10-20 words long that are somewhat self-contained blocks
>of text; complete sentences or nearly. These clauses can be
>overlapping. What is a good way to use OpenNLP's tools?
>
>The application is for document summarization via LSA. This technique
>needs to operate on coherent statements rather than very long
>sentences. Some of my test data is riddled with 30-50-word sentences,
>and they have overlapping clauses which are coherent statements of the
>document themes.
>
>-- 
>Lance Norskog
>goksron@gmail.com
>