You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by wgggfiy <wu...@qq.com> on 2012/11/18 18:09:09 UTC

what is the offsets and payload in DocsAndPositionsEnum for ??

I'm now studying lucene 4.0. 
1, what is the startOffset and endOffset for ? is there a code example ?
2, what is payload ? I know just a little about it, and it can be used for
things like font weight, or XML enclosing tag.
3, I have a item like (lucene, 350, 450, 33.2, 2), where 350,450 is the 
offset of the term 'lucene', and 33.2 is a score, and 2 is some id, my 
question is how I can make it indexed ?
my first idea is to relized my own posting list format, but is it possible
to make it with the startOffset, endOffset and payload ?
thx.

wgggfiy



--
View this message in context: http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsAndPositionsEnum-for-tp4020933.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: what is the offsets and payload in DocsAndPositionsEnum for ??

Posted by Ian Lea <ia...@gmail.com>.

Well, according to the javadoc, PayloadTermQuery "factors in the value
of the payload located at each of the positions where the Term
occurs".

Have you read some of the info available from Google by searching for
"lucene payloads"?


--
Ian.


On Fri, Nov 23, 2012 at 8:32 AM, wgggfiy <wu...@qq.com> wrote:
> After I finish "packing your information into a payload", but
> is there some method to search with the information ?
> what is the "PayloadTermQuery" for ??
> thx
>
>
>
> -----
> --------------------------
> Email: wuqiu.main@qq.com
> --------------------------
> --
> View this message in context: http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsAndPositionsEnum-for-tp4020933p4021981.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: what is the offsets and payload in DocsAndPositionsEnum for ??

Posted by wgggfiy <wu...@qq.com>.

After I finish "packing your information into a payload", but
is there some method to search with the information ?
what is the "PayloadTermQuery" for ??
thx



-----
--------------------------
Email: wuqiu.main@qq.com
--------------------------
--
View this message in context: http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsAndPositionsEnum-for-tp4020933p4021981.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by wgggfiy <wu...@qq.com>.

Thx very much!
Lingpipe and Gate are very useful, and new to me,
but is it too larger to realize the custom like
class TestPostingItem
{
        int termId;
        long startOffset;
        long endOffset;
        float score;
        int segId;
        long timeStamp;
} ?



-----
--------------------------
Email: wuqiu.main@qq.com
--------------------------
--
View this message in context: http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsAndPositionsEnum-for-tp4020933p4026571.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by Carsten Schnober <sc...@ids-mannheim.de>.

Am 18.12.2012 12:36, schrieb Michael McCandless:
> On Thu, Dec 13, 2012 at 8:32 AM, Carsten Schnober
> <sc...@ids-mannheim.de> wrote:


>> This is a relatively easy example, but how would deal with e.g.
>> annotations that include multiple tokens (as in spans), such as chunks,
>> or relations between tokens (and token spans), as in the coreference
>> links example given by Steven above?
> 
> I think you'd do something like what SynonymFilter does for
> multi-token synonyms.
> 
> Eg a synonym for "wireless network" - > wifi would insert a new token
> ("wifi"), overlapped on wireless.
> 
> Lucene doesn't store the end span, but if this is really important for
> your use case, you could add a payload to that wifi token that would
> encode the number of positions that the inserted token spans (2 in
> this case), and then the information would be present in the index.
> 
> You'd still need to do something custom at read/search time to decode
> this end position and do something interesting with it ...

Thanks for the pointer!
I'm still puzzled whether something there is an optimal way to encode
(labelled) relations between tokens or even spans; the latter part would
probably lead back to the synonym-like solution.
Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Thu, Dec 13, 2012 at 8:32 AM, Carsten Schnober
<sc...@ids-mannheim.de> wrote:
> Am 13.12.2012 12:27, schrieb Michael McCandless:
>
>>> For example:
>>>  - part of speech of a token.
>>>  - syntactic parse subtree (over a span).
>>>  - semantically normalized phrase (to canonical text or ontological code).
>>>  - semantic group (of a span).
>>>  - coreference link.
>>
>> So for example part-of-speech is a per-Token-position attribute.
>>
>> Today the easiest way to handle this is to encode these attributes
>> into a Payload, which is straightforward (make a custom TokenFilter
>> that creates the payload).
>>
>> At search time you would then use e.g. PayloadTermQuery to decode the
>> Payload and do something with it to alter how the query is being
>> scored.
>
> This is a relatively easy example, but how would deal with e.g.
> annotations that include multiple tokens (as in spans), such as chunks,
> or relations between tokens (and token spans), as in the coreference
> links example given by Steven above?

I think you'd do something like what SynonymFilter does for
multi-token synonyms.

Eg a synonym for "wireless network" - > wifi would insert a new token
("wifi"), overlapped on wireless.

Lucene doesn't store the end span, but if this is really important for
your use case, you could add a payload to that wifi token that would
encode the number of positions that the inserted token spans (2 in
this case), and then the information would be present in the index.

You'd still need to do something custom at read/search time to decode
this end position and do something interesting with it ...

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by Carsten Schnober <sc...@ids-mannheim.de>.

Am 13.12.2012 12:27, schrieb Michael McCandless:

>> For example:
>>  - part of speech of a token.
>>  - syntactic parse subtree (over a span).
>>  - semantically normalized phrase (to canonical text or ontological code).
>>  - semantic group (of a span).
>>  - coreference link.
> 
> So for example part-of-speech is a per-Token-position attribute.
> 
> Today the easiest way to handle this is to encode these attributes
> into a Payload, which is straightforward (make a custom TokenFilter
> that creates the payload).
> 
> At search time you would then use e.g. PayloadTermQuery to decode the
> Payload and do something with it to alter how the query is being
> scored.

This is a relatively easy example, but how would deal with e.g.
annotations that include multiple tokens (as in spans), such as chunks,
or relations between tokens (and token spans), as in the coreference
links example given by Steven above?
Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: More about storing NLP-type stuff in the index

Posted by Michael Sokolov <so...@ifactory.com>.

On 1/3/2013 6:16 PM, Wu, Stephen T., Ph.D. wrote:
> I think we've been saying that if we put something in a Payload, it will be
> indexed.  From what I understand of the indexing format, that means that
> what you put in the Payload will be stored in the Lucene index... But it
> won't *itself* be indexed & optimized for search.
>
> That's good, but can we build inverted indices on the contents of the
> Payloads (or the Attributes) as well?
>   Ex1: Say I put semantic role labels like ARG0 into my index. Say my search
> is looking for all instances of ARG0.
>   Ex2: Say I add payloads to terms indicating that they're named entities
> belonging to a semantic group.  Then say my query looks for all instances of
> the "Medications" semantic group.
>
> It's almost like just putting these things in different fields, with the
> exception that the things in different fields need to be linked so you know
> what the original text was.  Maybe the linking can be done via Payloads
> (offsets in the original text)?  If I want to store multiple things at the
> same startOffset then I just use something like SynonymFilter?
>
I've been working on a different but (in a way) related problem: 
indexing text in XML documents.  In that case, we want to associate the 
names of enclosing elements with each term so that it's possible to 
search for (say) "ermine" in the context /doc/title as distinct from 
"ermine" in the context of //paragraph, or something like that.  Anyway 
what I've done doesn't use payloads.  I index two fields that are 
relevant to this: a full text field, which is just the usual text index 
(per document), and then an element-text field which indexes each term 
as a concatenation of the element name and the term value, so: 
title:ermine, doc:ermine, and paragraph:ermine would be typical terms.  
I index all of the enclosing element names for each word at the same 
position (like synonym filter does). This relies on a magical character 
(":") that isn't allowed to appear in any tokens, which is too bad, but 
not terribly restrictive.

Something like this might work for you.  The prefixing also has the nice 
feature that when you enumerate terms, they are ordered first by prefix: 
of course you could flip the order if it were more interesting to list 
all "contexts" for a word rather than all words in a context (or with 
some POS tag).

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

More about storing NLP-type stuff in the index

Posted by "Wu, Stephen T., Ph.D." <Wu...@mayo.edu>.

I think we've been saying that if we put something in a Payload, it will be
indexed.  From what I understand of the indexing format, that means that
what you put in the Payload will be stored in the Lucene index... But it
won't *itself* be indexed & optimized for search.

That's good, but can we build inverted indices on the contents of the
Payloads (or the Attributes) as well?
 Ex1: Say I put semantic role labels like ARG0 into my index. Say my search
is looking for all instances of ARG0.
 Ex2: Say I add payloads to terms indicating that they're named entities
belonging to a semantic group.  Then say my query looks for all instances of
the "Medications" semantic group.

It's almost like just putting these things in different fields, with the
exception that the things in different fields need to be linked so you know
what the original text was.  Maybe the linking can be done via Payloads
(offsets in the original text)?  If I want to store multiple things at the
same startOffset then I just use something like SynonymFilter?

stephen

On 12/21/12 6:45 AM, "Michael McCandless" <lu...@mikemccandless.com> wrote:

> On Thu, Dec 20, 2012 at 3:54 PM, Wu, Stephen T., Ph.D.
> <Wu...@mayo.edu> wrote:
>>> If you stuff the end of the span into the payload you'd have to create
>>> a custom variant of PhraseQuery to properly match based on the end
>>> span.
>> 
>> How different is this from the functionality already avaialable through
>> SpanQuery?
> 
> Good question!
> 
> I think the difference would be index-time (payload encoding span-end
> + new Query) vs search time (SpanQuery)?
> 
> Ie, with the former (index-time) you'd have a TokenFilter spotting the
> spans and encoding them into the index, and with the latter all
> spotting happens at search time?
> 
> So net/net I guess (?) the results would be the same, but performance
> should be faster if you do it index-time?
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Thu, Dec 20, 2012 at 3:54 PM, Wu, Stephen T., Ph.D.
<Wu...@mayo.edu> wrote:
>> If you stuff the end of the span into the payload you'd have to create
>> a custom variant of PhraseQuery to properly match based on the end
>> span.
>
> How different is this from the functionality already avaialable through
> SpanQuery?

Good question!

I think the difference would be index-time (payload encoding span-end
+ new Query) vs search time (SpanQuery)?

Ie, with the former (index-time) you'd have a TokenFilter spotting the
spans and encoding them into the index, and with the latter all
spotting happens at search time?

So net/net I guess (?) the results would be the same, but performance
should be faster if you do it index-time?

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by "Wu, Stephen T., Ph.D." <Wu...@mayo.edu>.

> If you stuff the end of the span into the payload you'd have to create
> a custom variant of PhraseQuery to properly match based on the end
> span.

How different is this from the functionality already avaialable through
SpanQuery?

stephen


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Thu, Dec 13, 2012 at 10:09 AM, Glen Newton <gl...@gmail.com> wrote:
>>Unfortunately, Lucene doesn't properly index
> spans (it records the start position but not the end position), so
> that limits what kind of matching you can do at search time.
>
> If this could be fixed (i.e. indexing the _end_ of a span) I think all
> the things that I want to do, and the things that can now be done in
> GATE very easily, would be possible using Mike's suggested method.

What would you use the end of the span for?

For example, do you need to do the equivalent of and end-of-span-aware
PhraseQuery?

Ie, so that if the document is "wireless network is down", and I apply
the synonym "wireless network" -> "wifi" at indexing time, then the
end-span-aware-PhraseQuery would match "wifi is down" (unlike today).

If you stuff the end of the span into the payload you'd have to create
a custom variant of PhraseQuery to properly match based on the end
span.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by "Wu, Stephen T., Ph.D." <Wu...@mayo.edu>.

That would be really nice. Full standoff annotations open a lot of doors.

If we had them, though, I'm not sure exactly which of Mike's methods you'd
use?  I thought payloads were completely token-based and could not be
attached to spans regardless.  And the SynonymFilter is really to mimic the
behavior of multiple tokens/span... (though maybe you could add the other
tokens in as "synonyms" and then skip the tokens you added...?).
Mike, is all this stuff possible if we can just index the ends of spans?

stephen


On 12/13/12 9:09 AM, "Glen Newton" <gl...@gmail.com> wrote:

>> Unfortunately, Lucene doesn't properly index
> spans (it records the start position but not the end position), so
> that limits what kind of matching you can do at search time.
> 
> If this could be fixed (i.e. indexing the _end_ of a span) I think all
> the things that I want to do, and the things that can now be done in
> GATE very easily, would be possible using Mike's suggested method.
> 
> 
> -Glen
> 
> On Thu, Dec 13, 2012 at 6:27 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>> On Wed, Dec 12, 2012 at 3:02 PM, Wu, Stephen T., Ph.D.
>> <Wu...@mayo.edu> wrote:
>>>>> Is there any (preliminary) code checked in somewhere that I can look at,
>>>>> that would help me understand the practical issues that would need to be
>>>>> addressed?
>>>> 
>>>> Maybe we can make this more concrete: what new attribute are you
>>>> needing to record in the postings and access at search time?
>>> 
>>> For example:
>>>  - part of speech of a token.
>>>  - syntactic parse subtree (over a span).
>>>  - semantically normalized phrase (to canonical text or ontological code).
>>>  - semantic group (of a span).
>>>  - coreference link.
>> 
>> So for example part-of-speech is a per-Token-position attribute.
>> 
>> Today the easiest way to handle this is to encode these attributes
>> into a Payload, which is straightforward (make a custom TokenFilter
>> that creates the payload).
>> 
>> At search time you would then use e.g. PayloadTermQuery to decode the
>> Payload and do something with it to alter how the query is being
>> scored.
>> 
>> For the span-like attributes (eg a syntactic parse, semantically
>> normalized phrase) I think you'd need to do something like
>> SynonymFilter in your analysis, i.e. insert new tokens at the position
>> where the span started.  Unfortunately, Lucene doesn't properly index
>> spans (it records the start position but not the end position), so
>> that limits what kind of matching you can do at search time.
>> 
>> Mike McCandless
>> 
>> http://blog.mikemccandless.com
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by Glen Newton <gl...@gmail.com>.

>Unfortunately, Lucene doesn't properly index
spans (it records the start position but not the end position), so
that limits what kind of matching you can do at search time.

If this could be fixed (i.e. indexing the _end_ of a span) I think all
the things that I want to do, and the things that can now be done in
GATE very easily, would be possible using Mike's suggested method.


-Glen

On Thu, Dec 13, 2012 at 6:27 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> On Wed, Dec 12, 2012 at 3:02 PM, Wu, Stephen T., Ph.D.
> <Wu...@mayo.edu> wrote:
>>>> Is there any (preliminary) code checked in somewhere that I can look at,
>>>> that would help me understand the practical issues that would need to be
>>>> addressed?
>>>
>>> Maybe we can make this more concrete: what new attribute are you
>>> needing to record in the postings and access at search time?
>>
>> For example:
>>  - part of speech of a token.
>>  - syntactic parse subtree (over a span).
>>  - semantically normalized phrase (to canonical text or ontological code).
>>  - semantic group (of a span).
>>  - coreference link.
>
> So for example part-of-speech is a per-Token-position attribute.
>
> Today the easiest way to handle this is to encode these attributes
> into a Payload, which is straightforward (make a custom TokenFilter
> that creates the payload).
>
> At search time you would then use e.g. PayloadTermQuery to decode the
> Payload and do something with it to alter how the query is being
> scored.
>
> For the span-like attributes (eg a syntactic parse, semantically
> normalized phrase) I think you'd need to do something like
> SynonymFilter in your analysis, i.e. insert new tokens at the position
> where the span started.  Unfortunately, Lucene doesn't properly index
> spans (it records the start position but not the end position), so
> that limits what kind of matching you can do at search time.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



-- 
-
http://zzzoot.blogspot.com/
-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Wed, Dec 12, 2012 at 3:02 PM, Wu, Stephen T., Ph.D.
<Wu...@mayo.edu> wrote:
>>> Is there any (preliminary) code checked in somewhere that I can look at,
>>> that would help me understand the practical issues that would need to be
>>> addressed?
>>
>> Maybe we can make this more concrete: what new attribute are you
>> needing to record in the postings and access at search time?
>
> For example:
>  - part of speech of a token.
>  - syntactic parse subtree (over a span).
>  - semantically normalized phrase (to canonical text or ontological code).
>  - semantic group (of a span).
>  - coreference link.

So for example part-of-speech is a per-Token-position attribute.

Today the easiest way to handle this is to encode these attributes
into a Payload, which is straightforward (make a custom TokenFilter
that creates the payload).

At search time you would then use e.g. PayloadTermQuery to decode the
Payload and do something with it to alter how the query is being
scored.

For the span-like attributes (eg a syntactic parse, semantically
normalized phrase) I think you'd need to do something like
SynonymFilter in your analysis, i.e. insert new tokens at the position
where the span started.  Unfortunately, Lucene doesn't properly index
spans (it records the start position but not the end position), so
that limits what kind of matching you can do at search time.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by Glen Newton <gl...@gmail.com>.

+10

These are the kind of things you can do in GATE[1] using annotations[2].
A VERY useful feature.

-Glen

[1]http://gate.ac.uk
[2]http://gate.ac.uk/wiki/jape-repository/annotations.html

On Wed, Dec 12, 2012 at 3:02 PM, Wu, Stephen T., Ph.D.
<Wu...@mayo.edu> wrote:
>>> Is there any (preliminary) code checked in somewhere that I can look at,
>>> that would help me understand the practical issues that would need to be
>>> addressed?
>>
>> Maybe we can make this more concrete: what new attribute are you
>> needing to record in the postings and access at search time?
>
> For example:
>  - part of speech of a token.
>  - syntactic parse subtree (over a span).
>  - semantically normalized phrase (to canonical text or ontological code).
>  - semantic group (of a span).
>  - coreference link.
>
> stephen
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



-- 
-
http://zzzoot.blogspot.com/
-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Wed, Dec 12, 2012 at 9:08 PM, lukai <lu...@gmail.com> wrote:
> Do we have any plan to decouple the index process?
>
> Lucene was design for search, but according the question people ask in the
> thread it beyonds search functionality sometimes. Like we might want to
> customize our scoring function based on payload. Sometimes i dont need to
> store TF/IDF information. We can pre-calculate features and store into the
> system. But i still need to store the extra TF/IDF information. And
> sometimes, i think we want to load the whole postings into memory to speed
> up the performance. In that case, we really want to customize the
> functionality/process of Inverted index.

Much of this can already be done with Lucene.  Eg, plug in your own
Similarity to get custom scoring (and we already have a bunch of
standard models ... TF/IDF (default), BM25, DFR, language models,
etc.).  Use MemoryPostingsFormat to pull everything into RAM.
Customize other parts of the index using your own Codec.

> The main problem is, the
> implementation is highly coupled with the index chain. It's not easy to
> re-write a new one. Do we have plan to make the index chain change more
> easier?
>
> Flexible index chain logic, flexible codecs format.

The indexing chain, which is inside IndexWriter and processes each
document into temporary RAM structures and then writes a new segment
via the Codec API, can in fact be changed, but it's extremely expert
and the APIs are not documented (you must read the source code to work
through it).

That said, customizing the chain is rarely really necessary ...
typically existing pluggability (payloads, Sims, custom codec) can
solve most problems.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by lukai <lu...@gmail.com>.

Do we have any plan to decouple the index process?

Lucene was design for search, but according the question people ask in the
thread it beyonds search functionality sometimes. Like we might want to
customize our scoring function based on payload. Sometimes i dont need to
store TF/IDF information. We can pre-calculate features and store into the
system. But i still need to store the extra TF/IDF information. And
sometimes, i think we want to load the whole postings into memory to speed
up the performance. In that case, we really want to customize the
functionality/process of Inverted index. The main problem is, the
implementation is highly coupled with the index chain. It's not easy to
re-write a new one. Do we have plan to make the index chain change more
easier?

Flexible index chain logic, flexible codecs format.

Thanks,

On Fri, Nov 30, 2012 at 10:02 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Fri, Nov 30, 2012 at 12:25 PM, Wu, Stephen T., Ph.D.
> <Wu...@mayo.edu> wrote:
> > Is there any (preliminary) code checked in somewhere that I can look at,
> > that would help me understand the practical issues that would need to be
> > addressed?
> >
> > If I understand you correctly, it's a little different from what's
> happening
> > in your blog posts:
> >
> http://blog.mikemccandless.com/2012/07/building-new-lucene-postings-format.h
> > tml
> >
> http://blog.mikemccandless.com/2012/08/lucenes-new-blockpostingsformat-thank
> > s.html
> > Those posts deal with making your own codec, but not about changing
> what's
> > stored in the postings?  I guess I misunderstood "postings format"
> before.
>
> I don't know of any examples of adding an entirely new attribute to
> the postings, except via payloads.
>
> All the examples we have are of Codecs/PostingsFormats/etc. storing
> all the usual attributes (term & its stats (docFreq/totalTermFreq),
> doc, freq, position, offsets, payload) in "interesting" ways.
>
> Maybe we can make this more concrete: what new attribute are you
> needing to record in the postings and access at search time?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by SUJIT PAL <su...@comcast.net>.

Hi Glen,

I don't believe you can attach a single payload to multiple tokens. What I did for a similar requirement was to combine the tokens into a single "_" delimited single token and attached the payload to it. For example:

The Big Bad Wolf huffed and puffed and blew the house of the Three Little Pigs down.

Now assume "Big Bad Wolf" and "Three Little Pigs" are spans to which I would like to attach payloads to. I run the tokens through a custom tokenizer that produces:

The Big_Bad_Wolf$payload1 huffed and puffed and blew the house of the Three_Little_Pigs$payload2 down.

In my case this makes sense, ie I can treat the span as a single unit. Not sure about your use case.

HTH
Sujit

On Dec 13, 2012, at 2:08 PM, Glen Newton wrote:

> Cool! Sounds great!  :-)
> 
> Any pointers to a (Lucene) example that attaches a payload to a
> start..end span that is more than one token?
> 
> thanks,
> -Glen
> 
> On Thu, Dec 13, 2012 at 5:03 PM, Lance Norskog <go...@gmail.com> wrote:
>> I should not have added that note. The Opennlp patch gives a concrete
>> example of adding an annotation to text.
>> 
>> 
>> On 12/13/2012 01:54 PM, Glen Newton wrote:
>>> 
>>> It is not clear this is exactly what is needed/being discussed.
>>> 
>>> From the issue:
>>> "We are also planning a Tokenizer/TokenFilter that can put parts of
>>> speech as either payloads (PartOfSpeechAttribute?) on a token or at
>>> the same position."
>>> 
>>> This adds it to a token, not a span. 'same position' does not suggest
>>> it also records the end position.
>>> 
>>> -Glen
>>> 
>>> On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog <go...@gmail.com> wrote:
>>>> 
>>>> Parts-of-speech is available now, in the indexer.
>>>> 
>>>> LUCENE-2899 adds OpenNLP to the Lucene&Solr codebase. It does
>>>> parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an
>>>> Apache
>>>> project for natural-language processing.
>>>> 
>>>> Some parts are in Solr that could be in Lucene.
>>>> 
>>>> https://issues.apache.org/jira/browse/lucene-2899
>>>> 
>>>> 
>>>> On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote:
>>>>>>> 
>>>>>>> Is there any (preliminary) code checked in somewhere that I can look
>>>>>>> at,
>>>>>>> that would help me understand the practical issues that would need to
>>>>>>> be
>>>>>>> addressed?
>>>>>> 
>>>>>> Maybe we can make this more concrete: what new attribute are you
>>>>>> needing to record in the postings and access at search time?
>>>>> 
>>>>> For example:
>>>>>   - part of speech of a token.
>>>>>   - syntactic parse subtree (over a span).
>>>>>   - semantically normalized phrase (to canonical text or ontological
>>>>> code).
>>>>>   - semantic group (of a span).
>>>>>   - coreference link.
>>>>> 
>>>>> stephen
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>> 
>>> 
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 
> 
> 
> -- 
> -
> http://zzzoot.blogspot.com/
> -
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by Glen Newton <gl...@gmail.com>.

Cool! Sounds great!  :-)

Any pointers to a (Lucene) example that attaches a payload to a
start..end span that is more than one token?

thanks,
-Glen

On Thu, Dec 13, 2012 at 5:03 PM, Lance Norskog <go...@gmail.com> wrote:
> I should not have added that note. The Opennlp patch gives a concrete
> example of adding an annotation to text.
>
>
> On 12/13/2012 01:54 PM, Glen Newton wrote:
>>
>> It is not clear this is exactly what is needed/being discussed.
>>
>>  From the issue:
>> "We are also planning a Tokenizer/TokenFilter that can put parts of
>> speech as either payloads (PartOfSpeechAttribute?) on a token or at
>> the same position."
>>
>> This adds it to a token, not a span. 'same position' does not suggest
>> it also records the end position.
>>
>> -Glen
>>
>> On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog <go...@gmail.com> wrote:
>>>
>>> Parts-of-speech is available now, in the indexer.
>>>
>>> LUCENE-2899 adds OpenNLP to the Lucene&Solr codebase. It does
>>> parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an
>>> Apache
>>> project for natural-language processing.
>>>
>>> Some parts are in Solr that could be in Lucene.
>>>
>>> https://issues.apache.org/jira/browse/lucene-2899
>>>
>>>
>>> On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote:
>>>>>>
>>>>>> Is there any (preliminary) code checked in somewhere that I can look
>>>>>> at,
>>>>>> that would help me understand the practical issues that would need to
>>>>>> be
>>>>>> addressed?
>>>>>
>>>>> Maybe we can make this more concrete: what new attribute are you
>>>>> needing to record in the postings and access at search time?
>>>>
>>>> For example:
>>>>    - part of speech of a token.
>>>>    - syntactic parse subtree (over a span).
>>>>    - semantically normalized phrase (to canonical text or ontological
>>>> code).
>>>>    - semantic group (of a span).
>>>>    - coreference link.
>>>>
>>>> stephen
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



-- 
-
http://zzzoot.blogspot.com/
-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by Lance Norskog <go...@gmail.com>.

I should not have added that note. The Opennlp patch gives a concrete 
example of adding an annotation to text.

On 12/13/2012 01:54 PM, Glen Newton wrote:
> It is not clear this is exactly what is needed/being discussed.
>
>  From the issue:
> "We are also planning a Tokenizer/TokenFilter that can put parts of
> speech as either payloads (PartOfSpeechAttribute?) on a token or at
> the same position."
>
> This adds it to a token, not a span. 'same position' does not suggest
> it also records the end position.
>
> -Glen
>
> On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog <go...@gmail.com> wrote:
>> Parts-of-speech is available now, in the indexer.
>>
>> LUCENE-2899 adds OpenNLP to the Lucene&Solr codebase. It does
>> parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an Apache
>> project for natural-language processing.
>>
>> Some parts are in Solr that could be in Lucene.
>>
>> https://issues.apache.org/jira/browse/lucene-2899
>>
>>
>> On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote:
>>>>> Is there any (preliminary) code checked in somewhere that I can look at,
>>>>> that would help me understand the practical issues that would need to be
>>>>> addressed?
>>>> Maybe we can make this more concrete: what new attribute are you
>>>> needing to record in the postings and access at search time?
>>> For example:
>>>    - part of speech of a token.
>>>    - syntactic parse subtree (over a span).
>>>    - semantically normalized phrase (to canonical text or ontological
>>> code).
>>>    - semantic group (of a span).
>>>    - coreference link.
>>>
>>> stephen
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by Glen Newton <gl...@gmail.com>.

It is not clear this is exactly what is needed/being discussed.

>From the issue:
"We are also planning a Tokenizer/TokenFilter that can put parts of
speech as either payloads (PartOfSpeechAttribute?) on a token or at
the same position."

This adds it to a token, not a span. 'same position' does not suggest
it also records the end position.

-Glen

On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog <go...@gmail.com> wrote:
> Parts-of-speech is available now, in the indexer.
>
> LUCENE-2899 adds OpenNLP to the Lucene&Solr codebase. It does
> parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an Apache
> project for natural-language processing.
>
> Some parts are in Solr that could be in Lucene.
>
> https://issues.apache.org/jira/browse/lucene-2899
>
>
> On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote:
>>>>
>>>> Is there any (preliminary) code checked in somewhere that I can look at,
>>>> that would help me understand the practical issues that would need to be
>>>> addressed?
>>>
>>> Maybe we can make this more concrete: what new attribute are you
>>> needing to record in the postings and access at search time?
>>
>> For example:
>>   - part of speech of a token.
>>   - syntactic parse subtree (over a span).
>>   - semantically normalized phrase (to canonical text or ontological
>> code).
>>   - semantic group (of a span).
>>   - coreference link.
>>
>> stephen
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>



-- 
-
http://zzzoot.blogspot.com/
-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by Lance Norskog <go...@gmail.com>.

Parts-of-speech is available now, in the indexer.

LUCENE-2899 adds OpenNLP to the Lucene&Solr codebase. It does 
parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an 
Apache project for natural-language processing.

Some parts are in Solr that could be in Lucene.

https://issues.apache.org/jira/browse/lucene-2899

On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote:
>>> Is there any (preliminary) code checked in somewhere that I can look at,
>>> that would help me understand the practical issues that would need to be
>>> addressed?
>> Maybe we can make this more concrete: what new attribute are you
>> needing to record in the postings and access at search time?
> For example:
>   - part of speech of a token.
>   - syntactic parse subtree (over a span).
>   - semantically normalized phrase (to canonical text or ontological code).
>   - semantic group (of a span).
>   - coreference link.
>
> stephen
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by "Wu, Stephen T., Ph.D." <Wu...@mayo.edu>.

>> Is there any (preliminary) code checked in somewhere that I can look at,
>> that would help me understand the practical issues that would need to be
>> addressed?
> 
> Maybe we can make this more concrete: what new attribute are you
> needing to record in the postings and access at search time?

For example: 
 - part of speech of a token.
 - syntactic parse subtree (over a span).
 - semantically normalized phrase (to canonical text or ontological code).
 - semantic group (of a span).
 - coreference link.

stephen


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Fri, Nov 30, 2012 at 12:25 PM, Wu, Stephen T., Ph.D.
<Wu...@mayo.edu> wrote:
> Is there any (preliminary) code checked in somewhere that I can look at,
> that would help me understand the practical issues that would need to be
> addressed?
>
> If I understand you correctly, it's a little different from what's happening
> in your blog posts:
> http://blog.mikemccandless.com/2012/07/building-new-lucene-postings-format.h
> tml
> http://blog.mikemccandless.com/2012/08/lucenes-new-blockpostingsformat-thank
> s.html
> Those posts deal with making your own codec, but not about changing what's
> stored in the postings?  I guess I misunderstood "postings format" before.

I don't know of any examples of adding an entirely new attribute to
the postings, except via payloads.

All the examples we have are of Codecs/PostingsFormats/etc. storing
all the usual attributes (term & its stats (docFreq/totalTermFreq),
doc, freq, position, offsets, payload) in "interesting" ways.

Maybe we can make this more concrete: what new attribute are you
needing to record in the postings and access at search time?

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by Jack Krupansky <ja...@basetechnology.com>.

"I will probably have to implement my own datastructure and 
parser/tokenizer/stemmer"

Why? I mean, I think the point of the Lucene architecture is that the codec 
level is completely independent of the analysis level.

The end result of analysis is a value to be stored from the application 
perspective, a "logical value" so to speak, but NOT the bit sequence, the 
"physical value" so to speak, that the codec will actually store.

So, go ahead and have your own codec that does whatever it wants with 
values, but the input for storage and query should be the output of a 
standard Lucene analyzer.

-- Jack Krupansky

-----Original Message----- 
From: Johannes.Lichtenberger
Sent: Friday, November 30, 2012 10:15 AM
To: java-user@lucene.apache.org
Cc: Michael McCandless
Subject: Re: What is "flexible indexing" in Lucene 4.0 if it's not the 
ability to make new postings codecs?

On 11/28/2012 01:11 AM, Michael McCandless wrote:
> Flexible indexing is the ability to make your own codec, which
> controls the reading and writing of all index parts (postings, stored
> fields, term vectors, deleted docs, etc.).
>
> So for example if you want to store some postings as a bit set instead
> of the block format that's the default coming up in 4.1, that's easy
> to do.
>
> But what is less easy (as I described below) is changing what is
> actually stored in the postings, eg adding a new per-position
> attribute.
>
> The original goal was to allow arbitrary attributes beyond the known
> docs/freqs/positions/offsets that Lucene supports today, so that you
> could easily make new application-dependent per-term, per-doc,
> per-position things, pull them from the analyzer, save them to the
> index, and access them from an IndexReader / query, but while some
> APIs do expose this, it's not very well explored yet (eg, you'd have
> to make a custom indexing chain to get the attributes "through"
> IndexWriter down to your codec).  It would be great to make progress
> making this easier, so ideas are very welcome :)

Regarding my questin/thread, is it also possible to change the backend
system? I'd like to use Lucene for a versioned DBMS, thus I would need
the ability to serialize/deserialize the bytes in my backend whereas
keys/values are stored in pages (for instance in an upcoming B+-tree, or
in  simple "unordered" pages via a record-ID/record mapping). But as no
one suggested anything as of now and I've also asked a year ago or so,
after implementing the B+-tree I will probably have to implement my own
datastructure and parser/tokenizer/stemmer... :-(

kind regards,
Johannes

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by "Johannes.Lichtenberger" <Jo...@uni-konstanz.de>.

On 11/28/2012 01:11 AM, Michael McCandless wrote:
> Flexible indexing is the ability to make your own codec, which
> controls the reading and writing of all index parts (postings, stored
> fields, term vectors, deleted docs, etc.).
>
> So for example if you want to store some postings as a bit set instead
> of the block format that's the default coming up in 4.1, that's easy
> to do.
>
> But what is less easy (as I described below) is changing what is
> actually stored in the postings, eg adding a new per-position
> attribute.
>
> The original goal was to allow arbitrary attributes beyond the known
> docs/freqs/positions/offsets that Lucene supports today, so that you
> could easily make new application-dependent per-term, per-doc,
> per-position things, pull them from the analyzer, save them to the
> index, and access them from an IndexReader / query, but while some
> APIs do expose this, it's not very well explored yet (eg, you'd have
> to make a custom indexing chain to get the attributes "through"
> IndexWriter down to your codec).  It would be great to make progress
> making this easier, so ideas are very welcome :)

Regarding my questin/thread, is it also possible to change the backend 
system? I'd like to use Lucene for a versioned DBMS, thus I would need 
the ability to serialize/deserialize the bytes in my backend whereas 
keys/values are stored in pages (for instance in an upcoming B+-tree, or 
in  simple "unordered" pages via a record-ID/record mapping). But as no 
one suggested anything as of now and I've also asked a year ago or so, 
after implementing the B+-tree I will probably have to implement my own 
datastructure and parser/tokenizer/stemmer... :-(

kind regards,
Johannes


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by "Wu, Stephen T., Ph.D." <Wu...@mayo.edu>.

Is there any (preliminary) code checked in somewhere that I can look at,
that would help me understand the practical issues that would need to be
addressed?

If I understand you correctly, it's a little different from what's happening
in your blog posts:
http://blog.mikemccandless.com/2012/07/building-new-lucene-postings-format.h
tml
http://blog.mikemccandless.com/2012/08/lucenes-new-blockpostingsformat-thank
s.html
Those posts deal with making your own codec, but not about changing what's
stored in the postings?  I guess I misunderstood "postings format" before.

stephen

> Flexible indexing is the ability to make your own codec, which
> controls the reading and writing of all index parts (postings, stored
> fields, term vectors, deleted docs, etc.).
> 
> So for example if you want to store some postings as a bit set instead
> of the block format that's the default coming up in 4.1, that's easy
> to do.
> 
> But what is less easy (as I described below) is changing what is
> actually stored in the postings, eg adding a new per-position
> attribute.
> 
> The original goal was to allow arbitrary attributes beyond the known
> docs/freqs/positions/offsets that Lucene supports today, so that you
> could easily make new application-dependent per-term, per-doc,
> per-position things, pull them from the analyzer, save them to the
> index, and access them from an IndexReader / query, but while some
> APIs do expose this, it's not very well explored yet (eg, you'd have
> to make a custom indexing chain to get the attributes "through"
> IndexWriter down to your codec).  It would be great to make progress
> making this easier, so ideas are very welcome :)
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Tue, Nov 27, 2012 at 3:37 PM, Wu, Stephen T., Ph.D.
> <Wu...@mayo.edu> wrote:
>> Following up on a previous question...
>> What is "flexible indexing" in Lucene 4.0?  We assumed it was the ability to
>> easily make new postings formats/codecs -- but a response below says that
>> would be "tricky"?
>> 
>> stephen
>> 
>> 
>> On 11/27/12 11:48 AM, "David Causse" <dc...@spotter.com> wrote:
>> 
>>> Hi,
>>> 
>>> We use payloads but we can't use the whole lucene API.
>>> For example we use it to do some relation query for example :
>>> 
>>> @quote(@speaker(obama) @discourse(health))
>>> 
>>> Search for all documents that contains a quote by Obama talking about
>>> health.
>>> We encode linguistic informations (standoff annotations) inside payloads
>>> and use custom search API to query the index.
>>> I didn't found a convenable way to attach my code to lucene
>>> Query/Scorer/Weight API. Like SpanQuery you have to rewrite the whole
>>> Query stack.
>>> In short if you want to go with Payloads that do more than boosting a
>>> term there's chances that you'll need to rewrite a big part of the query
>>> stack.
>>> 
>>> 
>>> Le 27/11/2012 16:59, Wu, Stephen T., Ph.D. a écrit :
>>>> I think we're looking at doing something related.  I haven't explored the
>>>> Enums or know how to make a postings codec... But what is "flexible
>>>> indexing" in Lucene 4.0 if it's not the ability to make new postings
>>>> codecs?
>>>> 
>>>> We're trying to incorporate attributes onto terms/spans in indexes.  We'd
>>>> also like to try out some interesting ways to score things that go beyond
>>>> just tokens.
>>>> 
>>>> We were considering using Attributes instead of Payloads, because it seems
>>>> like using Payloads ties you to a particular kind of scoring -- just a
>>>> weight on a token.  Can Payloads be used for more general scoring
>>>> functions?
>>>> E.g., considering a span of text alongside multiple Payloads?
>>>> 
>>>> Does it make sense to move outside of Payloads here?
>>>> 
>>>> Thanks!
>>>> 
>>>> stephen
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 11/19/12 8:14 AM, "Michael McCandless" <lu...@mikemccandless.com>
>>>> wrote:
>>>> 
>>>>> A new postings format would be tricky because you have new attributes
>>>>> you want to index.
>>>>> 
>>>>> The DocsAndPositionsEnum does have an attributes source, but this is
>>>>> not well explored, and there are known problems (they can't be easily
>>>>> merged in the composite reader case).
>>>>> 
>>>>> So that's why I suggested packing your information into a payload ...
>>>>> 
>>>>> Mike McCandless
>>>>> 
>>>>> http://blog.mikemccandless.com
>>>>> 
>>>>> On Sun, Nov 18, 2012 at 8:33 PM, wgggfiy <wu...@qq.com> wrote:
>>>>>> thx, mike.
>>>>>> about the 3th question, "encode them all into the payload" is better than
>>>>>> "a new postings format with the codec" ??
>>>>>> I mean replace the orginal posting item (position, startOffset,
>>>>>> endOffset,
>>>>>> payload) with my own inverted item such as
>>>>>> class TestPostingItem
>>>>>> {
>>>>>>          int termId;
>>>>>>          long startOffset;
>>>>>>          long endOffset;
>>>>>>          float score;
>>>>>>          int segId;
>>>>>>          long timeStamp;
>>>>>> }
>>>>>> ?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-Doc
>>>>>> sA
>>>>>> nd
>>>>>> PositionsEnum-for-tp4020933p4020968.html
>>>>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by Michael McCandless <lu...@mikemccandless.com>.

Flexible indexing is the ability to make your own codec, which
controls the reading and writing of all index parts (postings, stored
fields, term vectors, deleted docs, etc.).

So for example if you want to store some postings as a bit set instead
of the block format that's the default coming up in 4.1, that's easy
to do.

But what is less easy (as I described below) is changing what is
actually stored in the postings, eg adding a new per-position
attribute.

The original goal was to allow arbitrary attributes beyond the known
docs/freqs/positions/offsets that Lucene supports today, so that you
could easily make new application-dependent per-term, per-doc,
per-position things, pull them from the analyzer, save them to the
index, and access them from an IndexReader / query, but while some
APIs do expose this, it's not very well explored yet (eg, you'd have
to make a custom indexing chain to get the attributes "through"
IndexWriter down to your codec).  It would be great to make progress
making this easier, so ideas are very welcome :)

Mike McCandless

http://blog.mikemccandless.com

On Tue, Nov 27, 2012 at 3:37 PM, Wu, Stephen T., Ph.D.
<Wu...@mayo.edu> wrote:
> Following up on a previous question...
> What is "flexible indexing" in Lucene 4.0?  We assumed it was the ability to
> easily make new postings formats/codecs -- but a response below says that
> would be "tricky"?
>
> stephen
>
>
> On 11/27/12 11:48 AM, "David Causse" <dc...@spotter.com> wrote:
>
>> Hi,
>>
>> We use payloads but we can't use the whole lucene API.
>> For example we use it to do some relation query for example :
>>
>> @quote(@speaker(obama) @discourse(health))
>>
>> Search for all documents that contains a quote by Obama talking about
>> health.
>> We encode linguistic informations (standoff annotations) inside payloads
>> and use custom search API to query the index.
>> I didn't found a convenable way to attach my code to lucene
>> Query/Scorer/Weight API. Like SpanQuery you have to rewrite the whole
>> Query stack.
>> In short if you want to go with Payloads that do more than boosting a
>> term there's chances that you'll need to rewrite a big part of the query
>> stack.
>>
>>
>> Le 27/11/2012 16:59, Wu, Stephen T., Ph.D. a écrit :
>>> I think we're looking at doing something related.  I haven't explored the
>>> Enums or know how to make a postings codec... But what is "flexible
>>> indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?
>>>
>>> We're trying to incorporate attributes onto terms/spans in indexes.  We'd
>>> also like to try out some interesting ways to score things that go beyond
>>> just tokens.
>>>
>>> We were considering using Attributes instead of Payloads, because it seems
>>> like using Payloads ties you to a particular kind of scoring -- just a
>>> weight on a token.  Can Payloads be used for more general scoring functions?
>>> E.g., considering a span of text alongside multiple Payloads?
>>>
>>> Does it make sense to move outside of Payloads here?
>>>
>>> Thanks!
>>>
>>> stephen
>>>
>>>
>>>
>>>
>>> On 11/19/12 8:14 AM, "Michael McCandless" <lu...@mikemccandless.com> wrote:
>>>
>>>> A new postings format would be tricky because you have new attributes
>>>> you want to index.
>>>>
>>>> The DocsAndPositionsEnum does have an attributes source, but this is
>>>> not well explored, and there are known problems (they can't be easily
>>>> merged in the composite reader case).
>>>>
>>>> So that's why I suggested packing your information into a payload ...
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>> On Sun, Nov 18, 2012 at 8:33 PM, wgggfiy <wu...@qq.com> wrote:
>>>>> thx, mike.
>>>>> about the 3th question, "encode them all into the payload" is better than
>>>>> "a new postings format with the codec" ??
>>>>> I mean replace the orginal posting item (position, startOffset, endOffset,
>>>>> payload) with my own inverted item such as
>>>>> class TestPostingItem
>>>>> {
>>>>>          int termId;
>>>>>          long startOffset;
>>>>>          long endOffset;
>>>>>          float score;
>>>>>          int segId;
>>>>>          long timeStamp;
>>>>> }
>>>>> ?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsA
>>>>> nd
>>>>> PositionsEnum-for-tp4020933p4020968.html
>>>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Posted by "Wu, Stephen T., Ph.D." <Wu...@mayo.edu>.

Following up on a previous question...
What is "flexible indexing" in Lucene 4.0?  We assumed it was the ability to
easily make new postings formats/codecs -- but a response below says that
would be "tricky"?

stephen


On 11/27/12 11:48 AM, "David Causse" <dc...@spotter.com> wrote:

> Hi,
> 
> We use payloads but we can't use the whole lucene API.
> For example we use it to do some relation query for example :
> 
> @quote(@speaker(obama) @discourse(health))
> 
> Search for all documents that contains a quote by Obama talking about
> health.
> We encode linguistic informations (standoff annotations) inside payloads
> and use custom search API to query the index.
> I didn't found a convenable way to attach my code to lucene
> Query/Scorer/Weight API. Like SpanQuery you have to rewrite the whole
> Query stack.
> In short if you want to go with Payloads that do more than boosting a
> term there's chances that you'll need to rewrite a big part of the query
> stack.
> 
> 
> Le 27/11/2012 16:59, Wu, Stephen T., Ph.D. a écrit :
>> I think we're looking at doing something related.  I haven't explored the
>> Enums or know how to make a postings codec... But what is "flexible
>> indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?
>> 
>> We're trying to incorporate attributes onto terms/spans in indexes.  We'd
>> also like to try out some interesting ways to score things that go beyond
>> just tokens.
>> 
>> We were considering using Attributes instead of Payloads, because it seems
>> like using Payloads ties you to a particular kind of scoring -- just a
>> weight on a token.  Can Payloads be used for more general scoring functions?
>> E.g., considering a span of text alongside multiple Payloads?
>> 
>> Does it make sense to move outside of Payloads here?
>> 
>> Thanks!
>> 
>> stephen
>> 
>> 
>> 
>> 
>> On 11/19/12 8:14 AM, "Michael McCandless" <lu...@mikemccandless.com> wrote:
>> 
>>> A new postings format would be tricky because you have new attributes
>>> you want to index.
>>> 
>>> The DocsAndPositionsEnum does have an attributes source, but this is
>>> not well explored, and there are known problems (they can't be easily
>>> merged in the composite reader case).
>>> 
>>> So that's why I suggested packing your information into a payload ...
>>> 
>>> Mike McCandless
>>> 
>>> http://blog.mikemccandless.com
>>> 
>>> On Sun, Nov 18, 2012 at 8:33 PM, wgggfiy <wu...@qq.com> wrote:
>>>> thx, mike.
>>>> about the 3th question, "encode them all into the payload" is better than
>>>> "a new postings format with the codec" ??
>>>> I mean replace the orginal posting item (position, startOffset, endOffset,
>>>> payload) with my own inverted item such as
>>>> class TestPostingItem
>>>> {
>>>>          int termId;
>>>>          long startOffset;
>>>>          long endOffset;
>>>>          float score;
>>>>          int segId;
>>>>          long timeStamp;
>>>> }
>>>> ?
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> View this message in context:
>>>> http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsA
>>>> nd
>>>> PositionsEnum-for-tp4020933p4020968.html
>>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 
>> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: what is the offsets and payload in DocsAndPositionsEnum for ??

Posted by David Causse <dc...@spotter.com>.

Hi,

We use payloads but we can't use the whole lucene API.
For example we use it to do some relation query for example :

@quote(@speaker(obama) @discourse(health))

Search for all documents that contains a quote by Obama talking about 
health.
We encode linguistic informations (standoff annotations) inside payloads 
and use custom search API to query the index.
I didn't found a convenable way to attach my code to lucene 
Query/Scorer/Weight API. Like SpanQuery you have to rewrite the whole 
Query stack.
In short if you want to go with Payloads that do more than boosting a 
term there's chances that you'll need to rewrite a big part of the query 
stack.


Le 27/11/2012 16:59, Wu, Stephen T., Ph.D. a écrit :
> I think we're looking at doing something related.  I haven't explored the
> Enums or know how to make a postings codec... But what is "flexible
> indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?
>
> We're trying to incorporate attributes onto terms/spans in indexes.  We'd
> also like to try out some interesting ways to score things that go beyond
> just tokens.
>
> We were considering using Attributes instead of Payloads, because it seems
> like using Payloads ties you to a particular kind of scoring -- just a
> weight on a token.  Can Payloads be used for more general scoring functions?
> E.g., considering a span of text alongside multiple Payloads?
>
> Does it make sense to move outside of Payloads here?
>
> Thanks!
>
> stephen
>
>
>
>
> On 11/19/12 8:14 AM, "Michael McCandless" <lu...@mikemccandless.com> wrote:
>
>> A new postings format would be tricky because you have new attributes
>> you want to index.
>>
>> The DocsAndPositionsEnum does have an attributes source, but this is
>> not well explored, and there are known problems (they can't be easily
>> merged in the composite reader case).
>>
>> So that's why I suggested packing your information into a payload ...
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Sun, Nov 18, 2012 at 8:33 PM, wgggfiy <wu...@qq.com> wrote:
>>> thx, mike.
>>> about the 3th question, "encode them all into the payload" is better than
>>> "a new postings format with the codec" ??
>>> I mean replace the orginal posting item (position, startOffset, endOffset,
>>> payload) with my own inverted item such as
>>> class TestPostingItem
>>> {
>>>          int termId;
>>>          long startOffset;
>>>          long endOffset;
>>>          float score;
>>>          int segId;
>>>          long timeStamp;
>>> }
>>> ?
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsAnd
>>> PositionsEnum-for-tp4020933p4020968.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>


-- 
David Causse
Spotter
http://www.spotter.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: what is the offsets and payload in DocsAndPositionsEnum for ??

Posted by "Wu, Stephen T., Ph.D." <Wu...@mayo.edu>.

I think we're looking at doing something related.  I haven't explored the
Enums or know how to make a postings codec... But what is "flexible
indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

We're trying to incorporate attributes onto terms/spans in indexes.  We'd
also like to try out some interesting ways to score things that go beyond
just tokens. 

We were considering using Attributes instead of Payloads, because it seems
like using Payloads ties you to a particular kind of scoring -- just a
weight on a token.  Can Payloads be used for more general scoring functions?
E.g., considering a span of text alongside multiple Payloads?

Does it make sense to move outside of Payloads here?

Thanks!

stephen

On 11/19/12 8:14 AM, "Michael McCandless" <lu...@mikemccandless.com> wrote:

> A new postings format would be tricky because you have new attributes
> you want to index.
> 
> The DocsAndPositionsEnum does have an attributes source, but this is
> not well explored, and there are known problems (they can't be easily
> merged in the composite reader case).
> 
> So that's why I suggested packing your information into a payload ...
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Sun, Nov 18, 2012 at 8:33 PM, wgggfiy <wu...@qq.com> wrote:
>> thx, mike.
>> about the 3th question, "encode them all into the payload" is better than
>> "a new postings format with the codec" ??
>> I mean replace the orginal posting item (position, startOffset, endOffset,
>> payload) with my own inverted item such as
>> class TestPostingItem
>> {
>>         int termId;
>>         long startOffset;
>>         long endOffset;
>>         float score;
>>         int segId;
>>         long timeStamp;
>> }
>> ?
>> 
>> 
>> 
>> 
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsAnd
>> PositionsEnum-for-tp4020933p4020968.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: what is the offsets and payload in DocsAndPositionsEnum for ??

Posted by Michael McCandless <lu...@mikemccandless.com>.

A new postings format would be tricky because you have new attributes
you want to index.

The DocsAndPositionsEnum does have an attributes source, but this is
not well explored, and there are known problems (they can't be easily
merged in the composite reader case).

So that's why I suggested packing your information into a payload ...

Mike McCandless

http://blog.mikemccandless.com

On Sun, Nov 18, 2012 at 8:33 PM, wgggfiy <wu...@qq.com> wrote:
> thx, mike.
> about the 3th question, "encode them all into the payload" is better than
> "a new postings format with the codec" ??
> I mean replace the orginal posting item (position, startOffset, endOffset,
> payload) with my own inverted item such as
> class TestPostingItem
> {
>         int termId;
>         long startOffset;
>         long endOffset;
>         float score;
>         int segId;
>         long timeStamp;
> }
> ?
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsAndPositionsEnum-for-tp4020933p4020968.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: what is the offsets and payload in DocsAndPositionsEnum for ??

Posted by wgggfiy <wu...@qq.com>.

thx, mike.
about the 3th question, "encode them all into the payload" is better than
"a new postings format with the codec" ?? 
I mean replace the orginal posting item (position, startOffset, endOffset,
payload) with my own inverted item such as
class TestPostingItem
{
        int termId;
        long startOffset;
        long endOffset;
        float score;
        int segId;
        long timeStamp;
}
?




--
View this message in context: http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsAndPositionsEnum-for-tp4020933p4020968.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: what is the offsets and payload in DocsAndPositionsEnum for ??

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Sun, Nov 18, 2012 at 12:09 PM, wgggfiy <wu...@qq.com> wrote:
> I'm now studying lucene 4.0.
> 1, what is the startOffset and endOffset for ? is there a code example ?

These are set by the analyzer, to the start and end character offset
for this token (using the OffsetAttribute).  The offsets are used for
highlighting.

> 2, what is payload ? I know just a little about it, and it can be used for
> things like font weight, or XML enclosing tag.

It's an arbitrary per-token-position byte[] that you set during
analysis (using the PayloadAttribute).

> 3, I have a item like (lucene, 350, 450, 33.2, 2), where 350,450 is the
> offset of the term 'lucene', and 33.2 is a score, and 2 is some id, my
> question is how I can make it indexed ?
> my first idea is to relized my own posting list format, but is it possible
> to make it with the startOffset, endOffset and payload ?

You should probably encode them all into the payload; Lucene requires
that the offsets are "in order".

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org