You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Dave Kor <da...@gmail.com> on 2006/01/04 07:34:12 UTC

Good representation for part-of-speech, chunk, sentence boundary tags?

Hi,

  I would like to associate information (or labels) with each word or a
range of words in a document. Information such as this word is a noun, that
word is a verb, this period marks the end of a sentence, "kick the bucket"
is a contiguous phrase, "white house" is a location and so on. I am seeking
a good representation for such information so that they can be easily stored
as additional fields in a lucene document, and easily recovered after a
search. For the more technically inclined, this would allow me to store
part-of-speech tags, chunk tags, sentence boundary markers and parse trees
for every indexed document.

  These additional information will enable Lucene to perform additional
post-processing on retrieved documents for various purposes such as
information extraction, summarization, question answering, etc... Is there
any available api? If not, I would appreciate any suggestions and tips on
how such information can best be stored in a Lucene document.


Regards,
Dave.

Re: Good representation for part-of-speech, chunk, sentence boundary tags?

Posted by Grant Ingersoll <gs...@syr.edu>.

We tend to store this information externally in XML (or in a stored 
field), but that is probably more of an artifact of our other uses of 
the information outside of Lucene than any shortcoming in Lucene.

A few of us on the list had some ideas on sentence boundary storage 
here: 
http://www.mail-archive.com/java-user@lucene.apache.org/msg03694.html.  
You could do similar things with some of the other information I 
suspect.  Don't know a lot about how it works, but I think you might be 
able to leverage the fact that you can store multiple tokens at the same 
position.  For instance, you may be able to add the POS of label at the 
same position as the term.  Perhaps Erik could add whether this works or not

-Grant

Erik Hatcher wrote:

>
> On Jan 4, 2006, at 7:53 AM, Paul Elschot wrote:
>
>> On Wednesday 04 January 2006 07:34, Dave Kor wrote:
>>
>>> Hi,
>>>
>>>   I would like to associate information (or labels) with each word  
>>> or a
>>> range of words in a document. Information such as this word is a  
>>> noun, that
>>> word is a verb, this period marks the end of a sentence, "kick the  
>>> bucket"
>>> is a contiguous phrase, "white house" is a location and so on. I  am 
>>> seeking
>>> a good representation for such information so that they can be  
>>> easily stored
>>> as additional fields in a lucene document, and easily recovered  
>>> after a
>>> search. For the more technically inclined, this would allow me to  
>>> store
>>> part-of-speech tags, chunk tags, sentence boundary markers and  
>>> parse trees
>>> for every indexed document.
>>>
>>>   These additional information will enable Lucene to perform  
>>> additional
>>> post-processing on retrieved documents for various purposes such as
>>> information extraction, summarization, question answering, etc...  
>>> Is there
>>> any available api? If not, I would appreciate any suggestions and  
>>> tips on
>>> how such information can best be stored in a Lucene document.
>>
>>
>> Basically, the index information available in Lucene is the Term,  
>> which is a
>> combination of a field name and a token. For these Lucene indexes
>> document presence and all positions within a document.  Lucene also
>> indexes the field length as a norm.
>> By using one ore more extra fields the tags and sentence boundary  
>> markers
>> can be easily indexed at their positions. To search these have a  
>> look at the
>> span package.
>> In case you want to search for tokens combined with some (part of  
>> speech)
>> tag, and the tokens and their tags are in different fields, the  span 
>> package
>> is not sufficient, because it does not allow position search over  
>> different
>> fields.
>
>
> Paul - I'm interested in this topic myself.  Suppose the "text" field  
> is indexed but also entities are detected like names and places.   
> Suppose I'd like a query that was "all names that have the initials  
> EH in the text field" (where we could identify EH names by doing a  
> SpanRegexQuery for "E.* H.*".
>
> I've been pondering whether it makes sense for Lucene to be enhanced  
> to carry over a Token's type into the index such that it could factor  
> into the query also.
>
> Thoughts?
>
>     Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

-- 
------------------------------------------------------------------- 
Grant Ingersoll 
Sr. Software Engineer 
Center for Natural Language Processing 
Syracuse University 
School of Information Studies 
337 Hinds Hall 
Syracuse, NY 13244 

http://www.cnlp.org 
Voice:  315-443-5484 
Fax: 315-443-6886 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Good representation for part-of-speech, chunk, sentence boundary tags?

Posted by Paul Elschot <pa...@xs4all.nl>.

On Wednesday 04 January 2006 14:14, Erik Hatcher wrote:
> 
> On Jan 4, 2006, at 7:53 AM, Paul Elschot wrote:
> 
> > On Wednesday 04 January 2006 07:34, Dave Kor wrote:
> >> Hi,
> >>
...
> >>
> >>   These additional information will enable Lucene to perform  
> >> additional
> >> post-processing on retrieved documents for various purposes such as
> >> information extraction, summarization, question answering, etc...  
> >> Is there
> >> any available api? If not, I would appreciate any suggestions and  
> >> tips on
> >> how such information can best be stored in a Lucene document.
> >
> > Basically, the index information available in Lucene is the Term,  
> > which is a
> > combination of a field name and a token. For these Lucene indexes
> > document presence and all positions within a document.  Lucene also
> > indexes the field length as a norm.
> > By using one ore more extra fields the tags and sentence boundary  
> > markers
> > can be easily indexed at their positions. To search these have a  
> > look at the
> > span package.
> > In case you want to search for tokens combined with some (part of  
> > speech)
> > tag, and the tokens and their tags are in different fields, the  
> > span package
> > is not sufficient, because it does not allow position search over  
> > different
> > fields.
> 
> Paul - I'm interested in this topic myself.  Suppose the "text" field  
> is indexed but also entities are detected like names and places.   
> Suppose I'd like a query that was "all names that have the initials  
> EH in the text field" (where we could identify EH names by doing a  
> SpanRegexQuery for "E.* H.*".
> 
> I've been pondering whether it makes sense for Lucene to be enhanced  
> to carry over a Token's type into the index such that it could factor  
> into the query also.
> 
> Thoughts?

When the token type is a (small) integer, it could be stored in
addition to the indexed position as additional information in the
proximity index.
The document level  index could also be extended to
to be able to find the documents containing this typed token.

Part of speech tags would map nicely to this token type.
In the (recent) past there has been talk of structured fields and
sentence/paragraph level fields, but I don't recall details of
intended implementations.

Sentence/paragraph level fields are probably better handled by
extra proximity indexes for the same field, ie. without repeating
the term index. For this the document level index would need
a pointer into each proximity index, currently it has only one
pointer into the only proximity index.
When there are only a few types/tags, each of them could be
mapped to a level like this.

So there are (at least) two independent possibilities:
- store type/tag info in the proximity index, and maybe allow
  searching this on document level.
- add more proximity indexes for the same field.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Good representation for part-of-speech, chunk, sentence boundary tags?

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jan 4, 2006, at 7:53 AM, Paul Elschot wrote:

> On Wednesday 04 January 2006 07:34, Dave Kor wrote:
>> Hi,
>>
>>   I would like to associate information (or labels) with each word  
>> or a
>> range of words in a document. Information such as this word is a  
>> noun, that
>> word is a verb, this period marks the end of a sentence, "kick the  
>> bucket"
>> is a contiguous phrase, "white house" is a location and so on. I  
>> am seeking
>> a good representation for such information so that they can be  
>> easily stored
>> as additional fields in a lucene document, and easily recovered  
>> after a
>> search. For the more technically inclined, this would allow me to  
>> store
>> part-of-speech tags, chunk tags, sentence boundary markers and  
>> parse trees
>> for every indexed document.
>>
>>   These additional information will enable Lucene to perform  
>> additional
>> post-processing on retrieved documents for various purposes such as
>> information extraction, summarization, question answering, etc...  
>> Is there
>> any available api? If not, I would appreciate any suggestions and  
>> tips on
>> how such information can best be stored in a Lucene document.
>
> Basically, the index information available in Lucene is the Term,  
> which is a
> combination of a field name and a token. For these Lucene indexes
> document presence and all positions within a document.  Lucene also
> indexes the field length as a norm.
> By using one ore more extra fields the tags and sentence boundary  
> markers
> can be easily indexed at their positions. To search these have a  
> look at the
> span package.
> In case you want to search for tokens combined with some (part of  
> speech)
> tag, and the tokens and their tags are in different fields, the  
> span package
> is not sufficient, because it does not allow position search over  
> different
> fields.

Paul - I'm interested in this topic myself.  Suppose the "text" field  
is indexed but also entities are detected like names and places.   
Suppose I'd like a query that was "all names that have the initials  
EH in the text field" (where we could identify EH names by doing a  
SpanRegexQuery for "E.* H.*".

I've been pondering whether it makes sense for Lucene to be enhanced  
to carry over a Token's type into the index such that it could factor  
into the query also.

Thoughts?

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Good representation for part-of-speech, chunk, sentence boundary tags?

Posted by Paul Elschot <pa...@xs4all.nl>.

On Wednesday 04 January 2006 07:34, Dave Kor wrote:
> Hi,
> 
>   I would like to associate information (or labels) with each word or a
> range of words in a document. Information such as this word is a noun, that
> word is a verb, this period marks the end of a sentence, "kick the bucket"
> is a contiguous phrase, "white house" is a location and so on. I am seeking
> a good representation for such information so that they can be easily stored
> as additional fields in a lucene document, and easily recovered after a
> search. For the more technically inclined, this would allow me to store
> part-of-speech tags, chunk tags, sentence boundary markers and parse trees
> for every indexed document.
> 
>   These additional information will enable Lucene to perform additional
> post-processing on retrieved documents for various purposes such as
> information extraction, summarization, question answering, etc... Is there
> any available api? If not, I would appreciate any suggestions and tips on
> how such information can best be stored in a Lucene document.

Basically, the index information available in Lucene is the Term, which is a
combination of a field name and a token. For these Lucene indexes
document presence and all positions within a document.  Lucene also
indexes the field length as a norm.
By using one ore more extra fields the tags and sentence boundary markers
can be easily indexed at their positions. To search these have a look at the
span package.
In case you want to search for tokens combined with some (part of speech)
tag, and the tokens and their tags are in different fields, the span package
is not sufficient, because it does not allow position search over different
fields.
One use for positions as sentence boundary markers is to leave
gaps at the sentence boundaries. This can safely be done when the
slop (allowed distance) in the queries is always smaller than this gap.

Parse trees carry (much) more information, and these will not be easy to map
to Lucene, but it all depends on the searches you want to support.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org