You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Shane O'Sullivan <sh...@gmail.com> on 2005/10/10 15:37:43 UTC

Adding generic payloads to a Term's posting list

Hi,

To the best of my knowledge, it is not possible to add generic data to a
Term's posting list.
By this I mean info that is defined by the search engine, not Lucene itself.
Whereas Lucene adds some data to the posting lists, such as the term's
position within a document,
there are many other useful types of information that could be attached to a
term.

Some examples would be in XML documents, to store the depth of a tag in the
document,
or font information, such as if the term appeared in a header or in the main
body of text.

Are there any plans to add such functionality to the API? If not, where
would be a the appropriate place
to implement these changes? I presume the TermInfosWriter and
TermInfosReader would have to be altered,
as well as the classes which call them. Could this be done without having to
modify the index in such a way
that standard Lucene indexes couldn't read it?

Thanks

Shane

RE: Adding generic payloads to a Term's posting list

Posted by Grant Ingersoll <gs...@syr.edu>.
>From my understanding, I don't think there has been any work, except the
idea put forth by Doug and others.

Contributions are definitely welcome...

>-----Original Message-----
>From: Shane O'Sullivan [mailto:shaneosullivan1@gmail.com] 
>Sent: Tuesday, October 11, 2005 5:08 AM
>To: java-dev@lucene.apache.org
>Subject: Re: Adding generic payloads to a Term's posting list
>
>This is precisely what I am looking for. Does anyone know if 
>this work is going in to Lucene 2.0?
>
>Shane
>
>On 10/10/05, Grant Ingersoll <gs...@syr.edu> wrote:
>>
>> http://wiki.apache.org/jakarta-lucene/Lucene2Whiteboard
>>
>> See item #11 of API changes. Maybe along the lines of what you are 
>> interested in, although I don't know if anyone has even attempted a 
>> design of it. I would also like to see this, plus the 
>ability to store 
>> info at higher levels in the Index, such as Field (not on a 
>per token 
>> basis), Document (info about the document that spans it's 
>fields) and 
>> Index (such as coreference information). Alas, no time...
>>
>> -Grant
>>
>> >-----Original Message-----
>> >From: Shane O'Sullivan [mailto:shaneosullivan1@gmail.com]
>> >Sent: Monday, October 10, 2005 8:38 AM
>> >To: java-dev@lucene.apache.org
>> >Subject: Adding generic payloads to a Term's posting list
>> >
>> >Hi,
>> >
>> >To the best of my knowledge, it is not possible to add generic data 
>> >to a Term's posting list.
>> >By this I mean info that is defined by the search engine, 
>not Lucene 
>> >itself.
>> >Whereas Lucene adds some data to the posting lists, such as the 
>> >term's position within a document, there are many other 
>useful types 
>> >of information that could be attached to a term.
>> >
>> >Some examples would be in XML documents, to store the depth 
>of a tag 
>> >in the document, or font information, such as if the term 
>appeared in 
>> >a header or in the main body of text.
>> >
>> >Are there any plans to add such functionality to the API? If not, 
>> >where would be a the appropriate place to implement these 
>changes? I 
>> >presume the TermInfosWriter and TermInfosReader would have to be 
>> >altered, as well as the classes which call them. Could this be done 
>> >without having to modify the index in such a way that 
>standard Lucene 
>> >indexes couldn't read it?
>> >
>> >Thanks
>> >
>> >Shane
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Adding generic payloads to a Term's posting list

Posted by Shane O'Sullivan <sh...@gmail.com>.
This is precisely what I am looking for. Does anyone know if this work is
going in to Lucene 2.0?

Shane

On 10/10/05, Grant Ingersoll <gs...@syr.edu> wrote:
>
> http://wiki.apache.org/jakarta-lucene/Lucene2Whiteboard
>
> See item #11 of API changes. Maybe along the lines of what you are
> interested in, although I don't know if anyone has even attempted a design
> of it. I would also like to see this, plus the ability to store info at
> higher levels in the Index, such as Field (not on a per token basis),
> Document (info about the document that spans it's fields) and Index (such
> as
> coreference information). Alas, no time...
>
> -Grant
>
> >-----Original Message-----
> >From: Shane O'Sullivan [mailto:shaneosullivan1@gmail.com]
> >Sent: Monday, October 10, 2005 8:38 AM
> >To: java-dev@lucene.apache.org
> >Subject: Adding generic payloads to a Term's posting list
> >
> >Hi,
> >
> >To the best of my knowledge, it is not possible to add generic
> >data to a Term's posting list.
> >By this I mean info that is defined by the search engine, not
> >Lucene itself.
> >Whereas Lucene adds some data to the posting lists, such as
> >the term's position within a document, there are many other
> >useful types of information that could be attached to a term.
> >
> >Some examples would be in XML documents, to store the depth of
> >a tag in the document, or font information, such as if the
> >term appeared in a header or in the main body of text.
> >
> >Are there any plans to add such functionality to the API? If
> >not, where would be a the appropriate place to implement these
> >changes? I presume the TermInfosWriter and TermInfosReader
> >would have to be altered, as well as the classes which call
> >them. Could this be done without having to modify the index in
> >such a way that standard Lucene indexes couldn't read it?
> >
> >Thanks
> >
> >Shane
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Adding generic payloads to a Term's posting list

Posted by jian chen <ch...@gmail.com>.
Hi,

I have been studying the Lucene indexing code for a bit. I am not sure if I
understand the problem scope completely, but, storing extra information
using TermsInfoWriter may not solve the problem?

For the example of XML document tag depth, could that be a seperate field?
Because Lucene term is a combination of (field, termText), so, depth could
be a field and even though two XML tags are the same, if their depths are
different, they are still treated as separate terms.

This is what I could think about so far.

Jian

On 10/10/05, Grant Ingersoll <gs...@syr.edu> wrote:
>
> http://wiki.apache.org/jakarta-lucene/Lucene2Whiteboard
>
> See item #11 of API changes. Maybe along the lines of what you are
> interested in, although I don't know if anyone has even attempted a design
> of it. I would also like to see this, plus the ability to store info at
> higher levels in the Index, such as Field (not on a per token basis),
> Document (info about the document that spans it's fields) and Index (such
> as
> coreference information). Alas, no time...
>
> -Grant
>
> >-----Original Message-----
> >From: Shane O'Sullivan [mailto:shaneosullivan1@gmail.com]
> >Sent: Monday, October 10, 2005 8:38 AM
> >To: java-dev@lucene.apache.org
> >Subject: Adding generic payloads to a Term's posting list
> >
> >Hi,
> >
> >To the best of my knowledge, it is not possible to add generic
> >data to a Term's posting list.
> >By this I mean info that is defined by the search engine, not
> >Lucene itself.
> >Whereas Lucene adds some data to the posting lists, such as
> >the term's position within a document, there are many other
> >useful types of information that could be attached to a term.
> >
> >Some examples would be in XML documents, to store the depth of
> >a tag in the document, or font information, such as if the
> >term appeared in a header or in the main body of text.
> >
> >Are there any plans to add such functionality to the API? If
> >not, where would be a the appropriate place to implement these
> >changes? I presume the TermInfosWriter and TermInfosReader
> >would have to be altered, as well as the classes which call
> >them. Could this be done without having to modify the index in
> >such a way that standard Lucene indexes couldn't read it?
> >
> >Thanks
> >
> >Shane
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

RE: Adding generic payloads to a Term's posting list

Posted by Grant Ingersoll <gs...@syr.edu>.
http://wiki.apache.org/jakarta-lucene/Lucene2Whiteboard 

See item #11 of API changes.  Maybe along the lines of what you are
interested in, although I don't know if anyone has even attempted a design
of it.  I would also like to see this, plus the ability to store info at
higher levels in the Index, such as Field (not on a per token basis),
Document (info about the document that spans it's fields) and Index (such as
coreference information).  Alas, no time...

-Grant

>-----Original Message-----
>From: Shane O'Sullivan [mailto:shaneosullivan1@gmail.com] 
>Sent: Monday, October 10, 2005 8:38 AM
>To: java-dev@lucene.apache.org
>Subject: Adding generic payloads to a Term's posting list
>
>Hi,
>
>To the best of my knowledge, it is not possible to add generic 
>data to a Term's posting list.
>By this I mean info that is defined by the search engine, not 
>Lucene itself.
>Whereas Lucene adds some data to the posting lists, such as 
>the term's position within a document, there are many other 
>useful types of information that could be attached to a term.
>
>Some examples would be in XML documents, to store the depth of 
>a tag in the document, or font information, such as if the 
>term appeared in a header or in the main body of text.
>
>Are there any plans to add such functionality to the API? If 
>not, where would be a the appropriate place to implement these 
>changes? I presume the TermInfosWriter and TermInfosReader 
>would have to be altered, as well as the classes which call 
>them. Could this be done without having to modify the index in 
>such a way that standard Lucene indexes couldn't read it?
>
>Thanks
>
>Shane
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org