You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Michael Busch <bu...@gmail.com> on 2006/12/20 15:19:18 UTC
Payloads
Hi all,
currently it is not possible to add generic payloads to a posting list.
However, this feature would be useful for various use cases. Some examples:
- XML search
to index XML documents and allow structured search (e.g. XPath) it is
neccessary to store the depths of the terms
- part-of-speech
payloads can be used to store the part of speech of a term occurrence
- term boost
for terms that occur e.g. in bold font a payload containing a boost
value can be stored
- ...
The feature payloads has been requested and discussed a couple of times,
e. g. in
- http://www.gossamer-threads.com/lists/lucene/java-dev/29465
- http://www.gossamer-threads.com/lists/lucene/java-dev/37409
In the latter thread I proposed a design a couple of months ago that
adds the possibility to Lucene to store variable-length payloads inline
in the posting list of a term. However, this design had some drawbacks:
the already complex field API was extended and the payloads encoding was
not optimal in terms of disk space. Furthermore, the overall Lucene
runtime performance suffered due to the growth of the .prx file. In the
meantime the patch LUCENE-687 (Lazy skipping on proximity file) was
committed, which reduces the number of reads and seeks on the .prx file.
This minimizes the performance degradation of a bigger .prx file. Also,
LUCENE-695 (Improve BufferedIndexInput.readBytes() performance) was
committed, that speeds up reading mid-size chunks of bytes, which is
beneficial for payloads that are bigger than just a few bytes.
Some weeks ago I started working on an improved design which I would
like to propose now. The new design simplifies the API extensions (the
Field API remains unchanged) and uses less disk space in most use cases.
Now there are only two classes that get new methods:
- Token.setPayload()
Use this method to add arbitrary metadata to a Token in the form of a
byte[] array.
- TermPositions.getPayload()
Use this method to retrieve the payload of a term occurrence.
The implementation is very flexible: the user does not have to enable
payloads explicilty for a field and can add payloads to all, some or no
Tokens. Due to the improved encoding those use cases are handled
efficiently in terms of disk space.
Another thing I would like to point out is that this feature is
backwards compatible, meaning that the file format only changes if the
user explicitly adds payloads to the index. If no payloads are used, all
data structures remain unchanged.
I'm going to open a new JIRA issue soon containing the patch and details
about implementation and file format changes.
One more comment: It is a rather big patch and this is the initial
version, so I'm sure there will be a lot of discussions. I would like to
encourage people who consider this feature as useful to try it out and
give me some feedback about possible improvements.
Best regards,
- Michael
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Michael Busch <bu...@gmail.com>.
Grant Ingersoll wrote:
> Couldn't agree more. This is good progress.
>
> I like the payloads patch, but I would like to see the lazy prox
> stream (Lucene 761) stuff done (or at least details given on it) so
> that we can hook this into Similarity so that it can be hooked into
> scoring. For 761 and the payload stuff, we need to make sure we do
> some benchmarking tests (see Doron's latest contribution under
> contrib/Benchmark for some cool tools to help w/ benchmarking)
>
> If you can do 761, I can then merge the two and then I can put up a
> patch for review that hooks in the scoring/Similarity idea that I
> _think_ will work and will allow a payload scoring factor to be
> calculated into the TermScorer and will be backward compatible and
> would allow people to score payloads w/o having to change very much.
>
> -Grant
Yep makes sense, Grant. I'm going to work on 761 the next days...
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Grant Ingersoll <gs...@apache.org>.
Couldn't agree more. This is good progress.
I like the payloads patch, but I would like to see the lazy prox
stream (Lucene 761) stuff done (or at least details given on it) so
that we can hook this into Similarity so that it can be hooked into
scoring. For 761 and the payload stuff, we need to make sure we do
some benchmarking tests (see Doron's latest contribution under
contrib/Benchmark for some cool tools to help w/ benchmarking)
If you can do 761, I can then merge the two and then I can put up a
patch for review that hooks in the scoring/Similarity idea that I
_think_ will work and will allow a payload scoring factor to be
calculated into the TermScorer and will be backward compatible and
would allow people to score payloads w/o having to change very much.
-Grant
On Jan 18, 2007, at 11:31 AM, Michael Busch wrote:
> Grant Ingersoll wrote:
>> Just to put in two cents: the Flexible Indexing thread has also
>> talked about the notion of being able to store arbitrary data at:
>> token, field, doc and Index level.
>>
>> -Grant
>>
>
> Yes I agree that this should be the long-term goal. The payload
> feature is just a first step in the direction of a flexible index
> format. I think it makes sense to add new functions incrementally,
> as long as we try to only extend the API in a way, so that it is
> compatible with the long-term goal, as Doug suggested already.
> After the payload patch is committed we can work on a more
> sophisticated per-doc-metadata solution. Until then we can use
> payloads for that use case. Flexible indexing is very complex and
> progress is progress... :-)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org
Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Jan 18, 2007, at 8:59 AM, Grant Ingersoll wrote:
> I think one thing that would really bolster the flex. indexing
> format changes would be to have someone write another
> implementation for it so that we can iron out any interface details
> that may be needed. For instance, maybe the Kino merge model?
Workin' on it. Subversion trunk for KS uses a unified postings
format, including per-position boost. Today I'm attempting to adapt
PostingsWriter, SegTermDocs (soon to be renamed PostingList) and the
scorers to deal with any logical combination of store_field_boost,
store_freq, store_position, and store_boost. I may need to work up
the position-aware coordinator for BooleanScorer before long, because
that will also have to tolerate multiple postings formats.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Grant Ingersoll <gs...@apache.org>.
I agree (and this has been discussed on this very thread in the past,
see Doug's comments). I would love to have someone take a look at
the flexible indexing patch that was submitted (I have looked a
little at it, but it is going to need more than just me since it is a
big change, although it is b. compatible, I believe. It needs to be
benchmarked, tested in threads, etc. so it may be a while to get to
the Flex. format. Thus, it _may_ make sense to put in payloads
first and mark them as "developer beware" in the comments and let
them be tested in the real world.
I think one thing that would really bolster the flex. indexing format
changes would be to have someone write another implementation for it
so that we can iron out any interface details that may be needed.
For instance, maybe the Kino merge model?
-Grant
On Jan 18, 2007, at 11:45 AM, Marvin Humphrey wrote:
>
> On Jan 18, 2007, at 8:31 AM, Michael Busch wrote:
>
>> I think it makes sense to add new functions incrementally, as long
>> as we try to only extend the API in a way, so that it is
>> compatible with the long-term goal, as Doug suggested already.
>> After the payload patch is committed we can work on a more
>> sophisticated per-doc-metadata solution. Until then we can use
>> payloads for that use case.
>
> I respectfully disagree with this plan.
>
> APIs are forever, implementations are ephemeral.
>
> By making a public API available for one aspect of the flexible
> indexing format, we limit our ability to change our minds about how
> that API should look later when we discover a more harmonious
> solution.
>
> If we're going to go the incremental route, IMO any API should be
> marked as experimental, or better, made private so that we can toy
> with it "in-house" on Lucene's innards, auditioning the changes
> before finalizing the API.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org
Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Michael Busch <bu...@gmail.com>.
Marvin Humphrey wrote:
>
> On Jan 18, 2007, at 8:31 AM, Michael Busch wrote:
>
>> I think it makes sense to add new functions incrementally, as long as
>> we try to only extend the API in a way, so that it is compatible with
>> the long-term goal, as Doug suggested already. After the payload
>> patch is committed we can work on a more sophisticated
>> per-doc-metadata solution. Until then we can use payloads for that
>> use case.
>
I think my comment was a bit confusing. The main intention of the
payloads is to use it for storing per-term metadata. However, with the
workaround Nadav suggested it is also possible to use it for per-doc
metadata, by simply storing only one token per document in a special
field. This solution works but is probably not the nicest. But why not
use this workaround as long as the payloads patch does not introduce an
API for the per-doc metadata that has to be removed/changed when we come
up with a dedicated implementation for that use case. With the payloads
patch I tried to keep the API changes as simple as possible (changes are
only made to Token and TermPositions). These changes are under
discussion in this thread with the intention to make them compatible
with the flexible-indexing API. I couldn't agree more that the API has
to be well-planned and I'd love to see your comments about the API
extensions I suggested, Marvin.
> I respectfully disagree with this plan.
>
> APIs are forever, implementations are ephemeral.
>
> By making a public API available for one aspect of the flexible
> indexing format, we limit our ability to change our minds about how
> that API should look later when we discover a more harmonious solution.
>
> If we're going to go the incremental route, IMO any API should be
> marked as experimental, or better, made private so that we can toy
> with it "in-house" on Lucene's innards, auditioning the changes before
> finalizing the API.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
I certainly agree to your suggestion to mark new APIs as experimental.
Then people would know that the API may change in the future and could
use it in their apps at own risk. At the same time we would benefit from
valuable feedback from those users that would help us perfecting the
API. The idea of having a flexible index format is already a year old I
think and at least in Java-Lucene there hasn't been made any progress
yet. So I'm all for the incremental approach, while marking new APIs
carefully as experimental.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Jan 18, 2007, at 8:31 AM, Michael Busch wrote:
> I think it makes sense to add new functions incrementally, as long
> as we try to only extend the API in a way, so that it is compatible
> with the long-term goal, as Doug suggested already. After the
> payload patch is committed we can work on a more sophisticated per-
> doc-metadata solution. Until then we can use payloads for that use
> case.
I respectfully disagree with this plan.
APIs are forever, implementations are ephemeral.
By making a public API available for one aspect of the flexible
indexing format, we limit our ability to change our minds about how
that API should look later when we discover a more harmonious solution.
If we're going to go the incremental route, IMO any API should be
marked as experimental, or better, made private so that we can toy
with it "in-house" on Lucene's innards, auditioning the changes
before finalizing the API.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Michael Busch <bu...@gmail.com>.
Grant Ingersoll wrote:
> Just to put in two cents: the Flexible Indexing thread has also talked
> about the notion of being able to store arbitrary data at: token,
> field, doc and Index level.
>
> -Grant
>
Yes I agree that this should be the long-term goal. The payload feature
is just a first step in the direction of a flexible index format. I
think it makes sense to add new functions incrementally, as long as we
try to only extend the API in a way, so that it is compatible with the
long-term goal, as Doug suggested already. After the payload patch is
committed we can work on a more sophisticated per-doc-metadata solution.
Until then we can use payloads for that use case. Flexible indexing is
very complex and progress is progress... :-)
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Grant Ingersoll <gs...@apache.org>.
Just to put in two cents: the Flexible Indexing thread has also
talked about the notion of being able to store arbitrary data at:
token, field, doc and Index level.
-Grant
On Jan 18, 2007, at 11:01 AM, Nadav Har'El wrote:
> On Thu, Jan 18, 2007, Michael Busch wrote about "Re: Payloads":
>> As you pointed out it is still possible to have per-doc payloads. You
>> need an analyzer which adds just one Token with payload to a specific
>> field for each doc. I understand that this code would be quite
>> ugly on
>> the app side. A more elegant solution might be LUCENE-580. With that
>> patch you are able to add pre-analyzed fields (i. e. TokenStreams)
>> to a
>> Document without having to use an analyzer. You could use a
>> TokenStream
>
> Thanks, this sounds like a good idea.
>
> In fact, I could live with something even simpler: I want to be able
> to create a Field with a single token (with its payload). If I need
> more
> than one of these tokens with payloads, I can just add several
> fields with
> the same name (this should work, although the description of
> LUCENE-580
> suggests that it might have a bug in this area).
>
> I'll add a comment about this use-case to LUCENE-580.
>
> --
> Nadav Har'El | Thursday, Jan 18 2007, 28
> Tevet 5767
> IBM Haifa Research Lab
> |-----------------------------------------
> |If glory comes after death,
> I'm not in a
> http://nadav.harel.org.il |hurry. (Latin proverb)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org
Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Michael Busch <bu...@gmail.com>.
Nadav Har'El wrote:
> On Thu, Jan 18, 2007, Michael Busch wrote about "Re: Payloads":
>
>> As you pointed out it is still possible to have per-doc payloads. You
>> need an analyzer which adds just one Token with payload to a specific
>> field for each doc. I understand that this code would be quite ugly on
>> the app side. A more elegant solution might be LUCENE-580. With that
>> patch you are able to add pre-analyzed fields (i. e. TokenStreams) to a
>> Document without having to use an analyzer. You could use a TokenStream
>>
>
> Thanks, this sounds like a good idea.
>
> In fact, I could live with something even simpler: I want to be able
> to create a Field with a single token (with its payload). If I need more
> than one of these tokens with payloads, I can just add several fields with
> the same name (this should work, although the description of LUCENE-580
> suggests that it might have a bug in this area).
>
> I'll add a comment about this use-case to LUCENE-580.
>
>
Yes for your use case it would indeed make sense to just add a single
Token to a field. But there are other use cases that would benefit from
580. E. g. when using UIMA as a parser. UIMA does not work per-field, it
materializes the tokens of all fields in a CAS. So the indexer can't
call the parser per field, the parsing has to be done before indexing.
So it would make sense to do the parsing and then add TokenStreams for
the different fields to the Document that only iterate through the CAS.
This is of course also possible by adding multiple Field instances
containing single Tokens to a Document, but the performance would
suffer. Each Token would be wrapped in a Field object and then hold in a
list in Document.
So I think being able to add TokenStreams to a Document makes sense.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Nadav Har'El <ny...@math.technion.ac.il>.
On Thu, Jan 18, 2007, Michael Busch wrote about "Re: Payloads":
> As you pointed out it is still possible to have per-doc payloads. You
> need an analyzer which adds just one Token with payload to a specific
> field for each doc. I understand that this code would be quite ugly on
> the app side. A more elegant solution might be LUCENE-580. With that
> patch you are able to add pre-analyzed fields (i. e. TokenStreams) to a
> Document without having to use an analyzer. You could use a TokenStream
Thanks, this sounds like a good idea.
In fact, I could live with something even simpler: I want to be able
to create a Field with a single token (with its payload). If I need more
than one of these tokens with payloads, I can just add several fields with
the same name (this should work, although the description of LUCENE-580
suggests that it might have a bug in this area).
I'll add a comment about this use-case to LUCENE-580.
--
Nadav Har'El | Thursday, Jan 18 2007, 28 Tevet 5767
IBM Haifa Research Lab |-----------------------------------------
|If glory comes after death, I'm not in a
http://nadav.harel.org.il |hurry. (Latin proverb)
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Michael Busch <bu...@gmail.com>.
Nadav Har'El wrote:
>
> Hi Michael,
>
> For some uses (e.g., faceted search), one wants to add a payload to each
> document, not per position for some text field. In the faceted search example,
> we could use payloads to encode the list of facets that each document
> belongs to. For this, with the old API, you could have added a fixed term
> to an untokenized field, add add a payload to that entire untokenized field.
>
> With the new API, it seems doing this is much more difficult and requires
> writing some sort of new Analyzer - one that will do the regular analysis
> that I want for the regulr fields, and add the payload to the one specific
> field that lists the facets.
> Am I understanding correctly? Or am I missing a better way to do this?
>
> Thanks,
> Nadav.
>
>
Hi Nadav,
you are referring to the first design I proposed in
http://www.gossamer-threads.com/lists/lucene/java-dev/37409
In that design I indeed had a method
public Field(String name, String value, Store store, Index index,
TermVector termVector, Payload payload);
which makes it easily possible to add a Payload without having to
implement an Analyzer. The Field API is already complex, that's the
reason why I removed this method in the new payloads version. And in
this thread we're also discussing to make the Token API more flexible,
so that it will be easier in the future to add more functionality.
As you pointed out it is still possible to have per-doc payloads. You
need an analyzer which adds just one Token with payload to a specific
field for each doc. I understand that this code would be quite ugly on
the app side. A more elegant solution might be LUCENE-580. With that
patch you are able to add pre-analyzed fields (i. e. TokenStreams) to a
Document without having to use an analyzer. You could use a TokenStream
implementation that emits only one Token. That would be a very simple
class. Another benefit is that whenever we add more functionality to
Token, we would not have to also provide another Field constructor. Do
you think this makes sense? I haven't looked at the LUCENE-580 code and
probably it needs to be updated since it is some months old, but I like
the idea.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Nadav Har'El <ny...@math.technion.ac.il>.
On Mon, Jan 08, 2007, Nicolas Lalev�e wrote about "Re: Payloads":
> I have looked closer to how lucene index, and I realized that for the facet
> feature, the kind of payload handling by Michael's patch are not designed for
> that. In this patch, the payloads are in the posting, ie in the tis, frq, prx
> files. Payload at the document level, that would be accessed in a scorer,
> should be better in the TermVector files, which are ordered by docs and not
> by term.
Well, it's sort of the same thing... Michael's patch allows putting payloads
at each position in a posting list; If you create a posting list which has
just one position per doc, you basically created a per-doc payload, ordered
by doc (like all posting lists).
And creating this posting list is easy: just pick an arbitrary field name
F and an arbitrary word W, and index the term (F,W) with the payload you want
for each document (basically, the list of categories that this document
belongs to).
I'm not saying this is the best way to do it, and certainly not the cleanest,
but it's just one of the things that payloads enable you to do.
--
Nadav Har'El | Wednesday, Jan 10 2007, 20 Tevet 5767
IBM Haifa Research Lab |-----------------------------------------
|Lumber Cartel member #2224.
http://nadav.harel.org.il |http://lumbercartel.freeyellow.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Nicolas Lalevée <ni...@anyware-tech.com>.
Le Mercredi 3 Janvier 2007 14:46, Nadav Har'El a écrit :
> On Wed, Dec 20, 2006, Michael Busch wrote about "Payloads":
> >..
> > Some weeks ago I started working on an improved design which I would
> > like to propose now. The new design simplifies the API extensions (the
> > Field API remains unchanged) and uses less disk space in most use cases.
> > Now there are only two classes that get new methods:
> > - Token.setPayload()
> > Use this method to add arbitrary metadata to a Token in the form of a
> > byte[] array.
> >...
>
> Hi Michael,
>
> For some uses (e.g., faceted search), one wants to add a payload to each
> document, not per position for some text field. In the faceted search
> example, we could use payloads to encode the list of facets that each
> document belongs to. For this, with the old API, you could have added a
> fixed term to an untokenized field, add add a payload to that entire
> untokenized field.
>
> With the new API, it seems doing this is much more difficult and requires
> writing some sort of new Analyzer - one that will do the regular analysis
> that I want for the regulr fields, and add the payload to the one specific
> field that lists the facets.
> Am I understanding correctly? Or am I missing a better way to do this?
I have looked closer to how lucene index, and I realized that for the facet
feature, the kind of payload handling by Michael's patch are not designed for
that. In this patch, the payloads are in the posting, ie in the tis, frq, prx
files. Payload at the document level, that would be accessed in a scorer,
should be better in the TermVector files, which are ordered by docs and not
by term.
--
Nicolas LALEVÉE
Solutions & Technologies
ANYWARE TECHNOLOGIES
Tel : +33 (0)5 61 00 52 90
Fax : +33 (0)5 61 00 51 46
http://www.anyware-tech.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Nadav Har'El <ny...@math.technion.ac.il>.
On Wed, Dec 20, 2006, Michael Busch wrote about "Payloads":
>..
> Some weeks ago I started working on an improved design which I would
> like to propose now. The new design simplifies the API extensions (the
> Field API remains unchanged) and uses less disk space in most use cases.
> Now there are only two classes that get new methods:
> - Token.setPayload()
> Use this method to add arbitrary metadata to a Token in the form of a
> byte[] array.
>...
Hi Michael,
For some uses (e.g., faceted search), one wants to add a payload to each
document, not per position for some text field. In the faceted search example,
we could use payloads to encode the list of facets that each document
belongs to. For this, with the old API, you could have added a fixed term
to an untokenized field, add add a payload to that entire untokenized field.
With the new API, it seems doing this is much more difficult and requires
writing some sort of new Analyzer - one that will do the regular analysis
that I want for the regulr fields, and add the payload to the one specific
field that lists the facets.
Am I understanding correctly? Or am I missing a better way to do this?
Thanks,
Nadav.
--
Nadav Har'El | Wednesday, Jan 3 2007, 13 Tevet 5767
IBM Haifa Research Lab |-----------------------------------------
|If you lost your left arm, your right arm
http://nadav.harel.org.il |would be left.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Michael Busch <bu...@gmail.com>.
Nicolas Lalevée wrote:
> Le Mercredi 20 Décembre 2006 15:31, Grant Ingersoll a écrit :
>
>> Hi Michael,
>>
>> Have a look at https://issues.apache.org/jira/browse/LUCENE-662
>>
>> I am planning on starting on this soon (I know, I have been saying
>> that for a while, but I really am.) At any rate, another set of eyes
>> would be good and I would be interested in hearing how your version
>> compares/works with this patch from Nicolas.
>>
>
> In fact the work I have done is more about the storing part of Lucene than the
> indexing part. But I think that the mechanism of defining in Java
> an "IndexFormat" I have introduced in my patch will be usefull in defining
> how the payload should be read and wrote.
>
> About my patch, it needs to be synchronized with the current trunk. I will
> update it soon. It just need some clean up.
>
> Nicolas
>
>
That's right, Nicolas' patch makes the Lucene *store* more flexible,
whereas my payloads patch extends the *index* data structures.
Nicolas, I'm aware of your patch but haven't looked completely at it
yet. I think it would be a great thing if our patches would work
together. And with Dougs suggestions (see his response) we would be on
the right track to the flexible indexing format! I would love to work
together with you to achieve this goal. I will look at your patch more
closely in the next days.
- Michael
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Nicolas Lalevée <ni...@anyware-tech.com>.
Le Mercredi 20 Décembre 2006 15:31, Grant Ingersoll a écrit :
> Hi Michael,
>
> Have a look at https://issues.apache.org/jira/browse/LUCENE-662
>
> I am planning on starting on this soon (I know, I have been saying
> that for a while, but I really am.) At any rate, another set of eyes
> would be good and I would be interested in hearing how your version
> compares/works with this patch from Nicolas.
In fact the work I have done is more about the storing part of Lucene than the
indexing part. But I think that the mechanism of defining in Java
an "IndexFormat" I have introduced in my patch will be usefull in defining
how the payload should be read and wrote.
About my patch, it needs to be synchronized with the current trunk. I will
update it soon. It just need some clean up.
Nicolas
>
> -Grant
>
> On Dec 20, 2006, at 9:19 AM, Michael Busch wrote:
> > Hi all,
> >
> > currently it is not possible to add generic payloads to a posting
> > list. However, this feature would be useful for various use cases.
> > Some examples:
> > - XML search
> > to index XML documents and allow structured search (e.g. XPath) it
> > is neccessary to store the depths of the terms
> > - part-of-speech
> > payloads can be used to store the part of speech of a term occurrence
> > - term boost
> > for terms that occur e.g. in bold font a payload containing a
> > boost value can be stored
> > - ...
> >
> > The feature payloads has been requested and discussed a couple of
> > times, e. g. in
> > - http://www.gossamer-threads.com/lists/lucene/java-dev/29465
> > - http://www.gossamer-threads.com/lists/lucene/java-dev/37409
> >
> > In the latter thread I proposed a design a couple of months ago
> > that adds the possibility to Lucene to store variable-length
> > payloads inline in the posting list of a term. However, this design
> > had some drawbacks: the already complex field API was extended and
> > the payloads encoding was not optimal in terms of disk space.
> > Furthermore, the overall Lucene runtime performance suffered due to
> > the growth of the .prx file. In the meantime the patch LUCENE-687
> > (Lazy skipping on proximity file) was committed, which reduces the
> > number of reads and seeks on the .prx file. This minimizes the
> > performance degradation of a bigger .prx file. Also, LUCENE-695
> > (Improve BufferedIndexInput.readBytes() performance) was committed,
> > that speeds up reading mid-size chunks of bytes, which is
> > beneficial for payloads that are bigger than just a few bytes.
> >
> > Some weeks ago I started working on an improved design which I
> > would like to propose now. The new design simplifies the API
> > extensions (the Field API remains unchanged) and uses less disk
> > space in most use cases. Now there are only two classes that get
> > new methods:
> > - Token.setPayload()
> > Use this method to add arbitrary metadata to a Token in the form
> > of a byte[] array.
> > - TermPositions.getPayload()
> > Use this method to retrieve the payload of a term occurrence.
> > The implementation is very flexible: the user does not have to
> > enable payloads explicilty for a field and can add payloads to all,
> > some or no Tokens. Due to the improved encoding those use cases are
> > handled efficiently in terms of disk space.
> >
> > Another thing I would like to point out is that this feature is
> > backwards compatible, meaning that the file format only changes if
> > the user explicitly adds payloads to the index. If no payloads are
> > used, all data structures remain unchanged.
> >
> > I'm going to open a new JIRA issue soon containing the patch and
> > details about implementation and file format changes.
> >
> > One more comment: It is a rather big patch and this is the initial
> > version, so I'm sure there will be a lot of discussions. I would
> > like to encourage people who consider this feature as useful to try
> > it out and give me some feedback about possible improvements.
> >
> > Best regards,
> > - Michael
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
>
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
--
Nicolas LALEVÉE
Solutions & Technologies
ANYWARE TECHNOLOGIES
Tel : +33 (0)5 61 00 52 90
Fax : +33 (0)5 61 00 51 46
http://www.anyware-tech.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Grant Ingersoll <gr...@gmail.com>.
Hi Michael,
Have a look at https://issues.apache.org/jira/browse/LUCENE-662
I am planning on starting on this soon (I know, I have been saying
that for a while, but I really am.) At any rate, another set of eyes
would be good and I would be interested in hearing how your version
compares/works with this patch from Nicolas.
-Grant
On Dec 20, 2006, at 9:19 AM, Michael Busch wrote:
> Hi all,
>
> currently it is not possible to add generic payloads to a posting
> list. However, this feature would be useful for various use cases.
> Some examples:
> - XML search
> to index XML documents and allow structured search (e.g. XPath) it
> is neccessary to store the depths of the terms
> - part-of-speech
> payloads can be used to store the part of speech of a term occurrence
> - term boost
> for terms that occur e.g. in bold font a payload containing a
> boost value can be stored
> - ...
>
> The feature payloads has been requested and discussed a couple of
> times, e. g. in
> - http://www.gossamer-threads.com/lists/lucene/java-dev/29465
> - http://www.gossamer-threads.com/lists/lucene/java-dev/37409
>
> In the latter thread I proposed a design a couple of months ago
> that adds the possibility to Lucene to store variable-length
> payloads inline in the posting list of a term. However, this design
> had some drawbacks: the already complex field API was extended and
> the payloads encoding was not optimal in terms of disk space.
> Furthermore, the overall Lucene runtime performance suffered due to
> the growth of the .prx file. In the meantime the patch LUCENE-687
> (Lazy skipping on proximity file) was committed, which reduces the
> number of reads and seeks on the .prx file. This minimizes the
> performance degradation of a bigger .prx file. Also, LUCENE-695
> (Improve BufferedIndexInput.readBytes() performance) was committed,
> that speeds up reading mid-size chunks of bytes, which is
> beneficial for payloads that are bigger than just a few bytes.
>
> Some weeks ago I started working on an improved design which I
> would like to propose now. The new design simplifies the API
> extensions (the Field API remains unchanged) and uses less disk
> space in most use cases. Now there are only two classes that get
> new methods:
> - Token.setPayload()
> Use this method to add arbitrary metadata to a Token in the form
> of a byte[] array.
> - TermPositions.getPayload()
> Use this method to retrieve the payload of a term occurrence.
> The implementation is very flexible: the user does not have to
> enable payloads explicilty for a field and can add payloads to all,
> some or no Tokens. Due to the improved encoding those use cases are
> handled efficiently in terms of disk space.
>
> Another thing I would like to point out is that this feature is
> backwards compatible, meaning that the file format only changes if
> the user explicitly adds payloads to the index. If no payloads are
> used, all data structures remain unchanged.
>
> I'm going to open a new JIRA issue soon containing the patch and
> details about implementation and file format changes.
>
> One more comment: It is a rather big patch and this is the initial
> version, so I'm sure there will be a lot of discussions. I would
> like to encourage people who consider this feature as useful to try
> it out and give me some feedback about possible improvements.
>
> Best regards,
> - Michael
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Michael Busch <bu...@gmail.com>.
Doug,
sorry for the late response. I was on vacation after New Year's... oh
btw. Happy New Year to everyone! :-)
Doug Cutting wrote:
> Michael Busch wrote:
>> Yes I could introduce a new class called e.g. PayloadToken that
>> extends Token (good that it is not final anymore). Not sure if I
>> understand your mixin interface idea... could you elaborate, please?
>
> I'm not entirely sure I understand it either!
>
> If Payload is an interface that tokens might implement, then some
> posting implementations would treat tokens that implement Payload
> specially. And there might be other interfaces, say, PartOfSpeech, or
> Emphasis, that tokens might implement, and that might also be handled
> by some posting implementations. A particular analyzer could emit
> tokens that implement several of these interfaces, e.g., both
> PartOfSpeech and Emphasis. So these interfaces would be mixins. But,
> of course, they'd also have to each be implemented by the Token
> subclass, since Java doesn't support multi-inheritance of implementation.
>
> I'm not sure this is the best approach: it's just the first one that
> comes to my mind. Perhaps instead Tokens should have a list of
> aspects, each of which implement a TokenAspect interface, or somesuch.
>
> It would be best to have an idea of how we'd like to be able to
> flexibly add token features like text-emphasis and part-of-speech that
> are handled specially by posting implementations before we add the
> Payload feature. So if the "mixin" approach is not a good idea, then
> we should try to think of a better one. If we can't think of a good
> approach, then we can always punt, add Payloads now, and deal with the
> consequences later. But it's worth trying first. Working through a
> few examples in pseudo code is perhaps a worthwhile task.
>
> Doug
Having a list of aspects for each Token really seems tempting. Something
like:
public interface TokenAspect {
String getAspectName();
}
Token gets new methods:
public void addTokenAspect(TokenAspect aspect);
public TokenAspect getTokenAspect(String name);
public List getTokenAspects();
Then Payload would implement TokenAspect and DocumentWriter (and maybe
PostingWriter in the future) can check if a Token has that aspect.
And Ning pointed out that this approach is also nice for chaining of
Analyzers or Filters. Different analyzers can simply add different
aspects to a Token. The only concern that I have is performance. With
this approach we would have to initialize a Map for every Token that has
one aspect or more. Can we afford this or would indexing speed suffer?
A solution with different Mixin interfaces would not have this
performance overhead. However, chaining of Analyzers is not easily
possible. E. g., if an Analyzer emits a Token subclass which implements
Payload and a TokenFilter wants to add another Mixin interface, lets say
PartOfSpeech, then the Filter would have to instantiate another Token
subclass that implements Payload and PartOfSpeech and either copy the
data from the first Token subclass or decorate it. The latter would
result in rather long and not very nice looking code for Token subclasses.
So besides the performance overhead I like the aspect approach. But
maybe there are other solutions we didn't think about yet, or I got you
wrong Doug and you had something different in mind? Thoughts?
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Ning Li <ni...@gmail.com>.
On 12/22/06, Doug Cutting <cu...@apache.org> wrote:
> Ning Li wrote:
> > The draft proposal seems to suggest the following (roughly):
> > A dictionary entry is <Term, FilePointer>.
>
> Perhaps this ought to be <Term, TermInfo>, where TermInfo contains a
> FilePointer and perhaps other information (e.g., frequency data).
Yes. Another example is skip data.
> So the ideal solution would permit both different formats to either
> share a file, or to use their own file(s).
Agree.
> Is it worth the complexity
> this would add to the API? Or should we jettison the notion of multiple
> posting files per segment?
+1 for a single posting file per segment. I was wondering if we wanted
to provide all the flexibility possible. Things will be much simpler
with a single posting file per segment... :-)
Ning
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Dec 22, 2006, at 10:36 AM, Doug Cutting wrote:
> The easiest way to do this would be to have separate files in each
> segment for each PostingFormat. It would be better if different
> posting formats could share files, but that's harder to coordinate.
The approach I'm taking in KinoSearch 0.20 is for each field to get
its own postings file, named _XXX.pYYY, where "_XXX" is the segment
name and "YYY" is the field number. That allows a single decoder to
be pointed at each file. _XXX.frq and _XXX.prx have been eliminated.
One file per format would also work.
> Alternately we could force all postings into a single file per
> segment. That would simplify the APIs, but prohibit certain file
> formats, like the one Lucene uses currently.
In theory, we could also have one file per property: doc num, freq,
positions, boost, payload. The base Posting object would have only
document number, and each subclass would add a new property, and a
new file.
I'm not sure that's better, as it precludes optimizations such as the
even/odd trick currently used in _XXX.frq, but it merits mention as
the conceptual opposite of having one file per format.
Matchers would be happy with that scheme no matter what.
> So the ideal solution would permit both different formats to either
> share a file, or to use their own file(s). Is it worth the
> complexity this would add to the API? Or should we jettison the
> notion of multiple posting files per segment?
Does punting on this issue have any drawbacks other than an unknown
performance impact? Can we design the API so that we leave open the
option of allowing the user to spec multiple files if that proves
advantageous later?
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Doug Cutting <cu...@apache.org>.
Ning Li wrote:
> The draft proposal seems to suggest the following (roughly):
> A dictionary entry is <Term, FilePointer>.
Perhaps this ought to be <Term, TermInfo>, where TermInfo contains a
FilePointer and perhaps other information (e.g., frequency data).
> A posting entry for a term in a document is <Doc, PostingContent>.
> Classes which implement PostingFormat decide the format of PostingContent.
Yes.
> Is it a good idea to allow PostingFormat to decide whether and how to
> store posting content in multiple files?
Ideally, yes. The easiest way to do this would be to have separate
files in each segment for each PostingFormat. It would be better if
different posting formats could share files, but that's harder to
coordinate.
Alternately we could force all postings into a single file per segment.
That would simplify the APIs, but prohibit certain file formats, like
the one Lucene uses currently.
So the ideal solution would permit both different formats to either
share a file, or to use their own file(s). Is it worth the complexity
this would add to the API? Or should we jettison the notion of multiple
posting files per segment?
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Dec 22, 2006, at 9:17 AM, Ning Li wrote:
> The question is, should the number of
> files used to store postings be customizable?
I think it ought to remain an implementation detail for now. Using
multiple files is an optimization of unknown advantage.
Optimizations have to work very hard to justify being put into public
APIs because they constrain later refactoring and may in fact prevent
better optimizations from being implemented later.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Doug Cutting <cu...@apache.org>.
Ning Li wrote:
> I'm aware of this design. Boolean and phrase queries are an example.
> The point is, there are different queries whose processing will
> (continue to) require different information of terms, especially when
> flexible posting is allowed. The question is, should the number of
> files used to store postings be customizable?
If one needs to search the same data with both unranked boolean
operators and with ranked proximity, one could use different fields. If
that's an acceptable answer, then we might get away with a single
posting file per segment. Back-compatibility will be a pain, but we
probably shouldn't let that drive the design.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Ning Li <ni...@gmail.com>.
On 12/22/06, Marvin Humphrey <ma...@rectangular.com> wrote:
> Precision would be enhanced if boolean scoring took position into
> account, and could be further enhanced if each position were assigned
> a boost. For that purpose, having everything in one file is an
> advantage, as it cuts down disk seeks. Turn off freqs, positions,
> and boosts, and you have only doc_nums, which is ideal for matching
> rather than scoring, yielding a performance gain.
I'm aware of this design. Boolean and phrase queries are an example.
The point is, there are different queries whose processing will
(continue to) require different information of terms, especially when
flexible posting is allowed. The question is, should the number of
files used to store postings be customizable?
Cheers,
Ning
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Dec 21, 2006, at 1:58 PM, Ning Li wrote:
> Storing all the posting content, e.g. frequencies and positions, in a
> single file greatly simplifies things. However, this could cause some
> performance penalty. For example, boolean query 'Apache AND Lucene'
> would have to paw through positions. But position indexing for Apache
> and Lucene is necessary to support phrase query '"Apache Lucene"'.
Precision would be enhanced if boolean scoring took position into
account, and could be further enhanced if each position were assigned
a boost. For that purpose, having everything in one file is an
advantage, as it cuts down disk seeks. Turn off freqs, positions,
and boosts, and you have only doc_nums, which is ideal for matching
rather than scoring, yielding a performance gain.
What's being considered doesn't really speak to the motivation of
improving existing core functionality, though. It's more about
expanding the API to allow new applications.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Ning Li <ni...@gmail.com>.
> 1. Make the index format extensible by adding user-implementable reader
> and writer interfaces for postings.
> ...
> Here's a very rough, sketchy, first draft of a type (1) proposal.
Nice!
In approach 1, what is the best abstraction of a flexible index format
for Lucene?
The draft proposal seems to suggest the following (roughly):
A dictionary entry is <Term, FilePointer>.
A posting entry for a term in a document is <Doc, PostingContent>.
Classes which implement PostingFormat decide the format of PostingContent.
Storing all the posting content, e.g. frequencies and positions, in a
single file greatly simplifies things. However, this could cause some
performance penalty. For example, boolean query 'Apache AND Lucene'
would have to paw through positions. But position indexing for Apache
and Lucene is necessary to support phrase query '"Apache Lucene"'.
Is it a good idea to allow PostingFormat to decide whether and how to
store posting content in multiple files?
A dictionary entry is <Term, <FilePointer>+>.
A posting entry for a term in a document is <Doc, <PostingContent>+>.
Each PostingContent is stored in a separate file.
Or is a two-file abstraction good enough? It supports all formats in
approaches 2 and 3.
A dictionary entry is <Term, FreqPointer, ProxPointer>.
A posting entry for a term in a document is <Doc,
PerDocPostingContent, <Position, PerPositionPostingContent>+>.
Doc and PerDocPostingContent are stored in a .frq file.
Position and PerPositionPostingContent are stored in a .prx file.
What Michael called Payload can be viewed as PerPositionPostingContent here.
> I'm not sure this is the best approach: it's just the first one that
> comes to my mind. Perhaps instead Tokens should have a list of aspects,
> each of which implement a TokenAspect interface, or somesuch.
Making Token have a list of aspects would work. A particular analyzer
would add certain types of aspects to the tokens it emits. For
example, one analyzer adds a TextEmphasis aspect to a token. Another
analyzer adds a PartOfSpeech aspect to the same token. A particular
posting implementation would expect certain types of aspects. For
example, one may require a TextEmphasis aspect and a PartOfSpeech
aspect. The posting implementation generates posting content (payload)
by encoding the values of both aspects.
Ning
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Doug Cutting <cu...@apache.org>.
Michael Busch wrote:
> Yes I could introduce a new class called e.g. PayloadToken that extends
> Token (good that it is not final anymore). Not sure if I understand your
> mixin interface idea... could you elaborate, please?
I'm not entirely sure I understand it either!
If Payload is an interface that tokens might implement, then some
posting implementations would treat tokens that implement Payload
specially. And there might be other interfaces, say, PartOfSpeech, or
Emphasis, that tokens might implement, and that might also be handled by
some posting implementations. A particular analyzer could emit tokens
that implement several of these interfaces, e.g., both PartOfSpeech and
Emphasis. So these interfaces would be mixins. But, of course, they'd
also have to each be implemented by the Token subclass, since Java
doesn't support multi-inheritance of implementation.
I'm not sure this is the best approach: it's just the first one that
comes to my mind. Perhaps instead Tokens should have a list of aspects,
each of which implement a TokenAspect interface, or somesuch.
It would be best to have an idea of how we'd like to be able to flexibly
add token features like text-emphasis and part-of-speech that are
handled specially by posting implementations before we add the Payload
feature. So if the "mixin" approach is not a good idea, then we should
try to think of a better one. If we can't think of a good approach,
then we can always punt, add Payloads now, and deal with the
consequences later. But it's worth trying first. Working through a few
examples in pseudo code is perhaps a worthwhile task.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Michael Busch <bu...@gmail.com>.
Doug Cutting wrote:
>
> A reason not to commit something like this now would be if it
> complicates the effort to make the format extensible. Each index
> feature we add now will require back-compatibility in the future, and
> we should be hesitant to add features that might be difficult to
> support in the future.
Yes, I agree.
I had the idea of defining Payload as an interface:
public interface Payload {
void serialize(IndexOutput out) throws IOException;
int serializedLength();
void deserialize(IndexInput in, int length) throws IOException;
}
and to have a default implementation ByteArrayPayload that works like my
current patch. Then people could write their own implementation of
Payload and define how to serialize the content.
>
> For example, this modifies the Token API. If, long-term, we think
> that Token should be extensible, then perhaps we should make it
> extensible now, and add this through a subclass of Token (perhaps a
> mixin interface that Tokens can implement).
>
Yes I could introduce a new class called e.g. PayloadToken that extends
Token (good that it is not final anymore). Not sure if I understand your
mixin interface idea... could you elaborate, please?
> I like the Payload feature, and think it should probably be added. I
> just want to make sure that we've first thought a bit about its
> future-compatibility.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Doug Cutting <cu...@apache.org>.
Michael Busch wrote:
> the other hand, if people would like to use the payloads soon I guess
> due to the backwards compatibility it would be low risk to add it to the
> current index format to provide this feature until we can finish the
> flexible format?
A reason not to commit something like this now would be if it
complicates the effort to make the format extensible. Each index
feature we add now will require back-compatibility in the future, and we
should be hesitant to add features that might be difficult to support in
the future.
For example, this modifies the Token API. If, long-term, we think that
Token should be extensible, then perhaps we should make it extensible
now, and add this through a subclass of Token (perhaps a mixin interface
that Tokens can implement).
I like the Payload feature, and think it should probably be added. I
just want to make sure that we've first thought a bit about its
future-compatibility.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Nicolas Lalevée <ni...@anyware-tech.com>.
Le Samedi 23 Décembre 2006 00:32, Michael Busch a écrit :
> Nicolas Lalevée wrote:
> > I have just looked at it. It looks great :)
>
> Thanks! :-)
>
> > But I still doesn't understand why a new entry in the fieldinfo is
> > needed.
>
> The entry is not really *needed*, but I use it for
> backwards-compatibility and as an optimization for fields that don't
> have any tokens with payloads. For fields with payloads the
> PositionDelta is shifted one bit, so for certain values this means that
> the VInt needs an extra byte. I have an index with about 500k web
> documents and measured, that about 8% of all PositionDelta values would
> need one extra byte in case PositionDelta is shifted. For my index that
> means roughly 4% growth of the total index size. With using a fieldbit,
> payloads can be disabled for a field and therefore the shifting of
> PositionDelta can be avoided. Furthermore, if the payload-fieldbit is
> not enabled, then the index format does not change at all.
>
> > There is the same for TermVector. And code like that fail for no obvious
> > reason :
> >
> > Document doc = new Document();
> > doc.add(new Field("f1", "v1", Store.YES, Index.TOKENIZED,
> > TermVector.WITH_POSITIONS_OFFSETS));
> > doc.add(new Field("f1", "v2", Store.YES, Index.TOKENIZED,
> > TermVector.NO));
> >
> > RAMDirectory ram = new RAMDirectory();
> > IndexWriter writer = new IndexWriter(ram, new StandardAnalyzer(), true);
> > writer.addDocument(doc);
> > writer.close();
> >
> > Knowing a little bit about how lucene works, I have an idea why this
> > fail, but can we avoid this ?
> >
> > Nicolas
>
> In the payload case there is no problem like this one. There is no new
> Field option that can be used to set the fieldbit explicitly. The bit is
> set automatically for a field as soon as the first Token of that field
> that carries a payload is encountered.
ok, thanks for the explaination. I looked closer to how indexing works, and in
fact the issue I was talking about was I think a bug. Filling a jira issue.
--
Nicolas LALEVÉE
Solutions & Technologies
ANYWARE TECHNOLOGIES
Tel : +33 (0)5 61 00 52 90
Fax : +33 (0)5 61 00 51 46
http://www.anyware-tech.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Michael Busch <bu...@gmail.com>.
Nicolas Lalevée wrote:
>
> I have just looked at it. It looks great :)
>
Thanks! :-)
> But I still doesn't understand why a new entry in the fieldinfo is needed.
>
The entry is not really *needed*, but I use it for
backwards-compatibility and as an optimization for fields that don't
have any tokens with payloads. For fields with payloads the
PositionDelta is shifted one bit, so for certain values this means that
the VInt needs an extra byte. I have an index with about 500k web
documents and measured, that about 8% of all PositionDelta values would
need one extra byte in case PositionDelta is shifted. For my index that
means roughly 4% growth of the total index size. With using a fieldbit,
payloads can be disabled for a field and therefore the shifting of
PositionDelta can be avoided. Furthermore, if the payload-fieldbit is
not enabled, then the index format does not change at all.
> There is the same for TermVector. And code like that fail for no obvious
> reason :
>
> Document doc = new Document();
> doc.add(new Field("f1", "v1", Store.YES, Index.TOKENIZED,
> TermVector.WITH_POSITIONS_OFFSETS));
> doc.add(new Field("f1", "v2", Store.YES, Index.TOKENIZED, TermVector.NO));
>
> RAMDirectory ram = new RAMDirectory();
> IndexWriter writer = new IndexWriter(ram, new StandardAnalyzer(), true);
> writer.addDocument(doc);
> writer.close();
>
> Knowing a little bit about how lucene works, I have an idea why this fail, but
> can we avoid this ?
>
> Nicolas
>
In the payload case there is no problem like this one. There is no new
Field option that can be used to set the fieldbit explicitly. The bit is
set automatically for a field as soon as the first Token of that field
that carries a payload is encountered.
Michael
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Nicolas Lalevée <ni...@anyware-tech.com>.
Le Mercredi 20 Décembre 2006 20:42, Michael Busch a écrit :
> Doug Cutting wrote:
> > Michael,
> >
> > This sounds like very good work. The back-compatibility of this
> > approach is great. But we should also consider this in the broader
> > context of index-format flexibility.
> >
> > Three general approaches have been proposed. They are not exclusive.
> >
> > 1. Make the index format extensible by adding user-implementable
> > reader and writer interfaces for postings.
> >
> > 2. Add a richer set of standard index formats, including things like
> > compressed fields, no-positions, per-position weights, etc.
> >
> > 3. Provide hooks for including arbitrary binary data.
> >
> > Your proposal is of type (3). LUCENE-662 is a (1). Approaches of
> > type (2) are most friendly to non-Java implementations, since the
> > semantics of each variation are well-defined.
> >
> > I don't see a reason not to pursue all three, but in a coordinated
> > manner. In particular, we don't want to add a feature of type (3)
> > that would make it harder to add type (1) APIs. It would thus be best
> > if we had a rough specification of type (1) and type (2). A proposal
> > of type (2) is at:
> >
> > http://wiki.apache.org/jakarta-lucene/FlexibleIndexing
> >
> > But I'm not sure that we yet have any proposed designs for an
> > extensible posting API. (Is anyone aware of one?) This payload
> > proposal can probably be easily incorporated into such a design, but I
> > would have more confidence if we had one. I guess I should attempt one!
>
> Doug,
>
> thanks for your detailed response. I'm aware that the long-term goal is
> the flexible index format and I see the payloads patch only as a part of
> it. The patch focuses on extending the index data structures and about a
> possible payload encoding. It doesn't focus yet on a flexible API, it
> only offers the two mentioned low-level methods to add and retrieve byte
> arrays.
>
> I would love to work with you guys on the flexible index format and to
> combine my patch with your suggestions and the patch from Nicolas! I
> will look at your proposal and Nicolas' patch tomorrow (have to go now).
> I just attached my patch (LUCENE-755), so if you get a chance you could
> take a look at it.
I have just looked at it. It looks great :)
But I still doesn't understand why a new entry in the fieldinfo is needed.
There is the same for TermVector. And code like that fail for no obvious
reason :
Document doc = new Document();
doc.add(new Field("f1", "v1", Store.YES, Index.TOKENIZED,
TermVector.WITH_POSITIONS_OFFSETS));
doc.add(new Field("f1", "v2", Store.YES, Index.TOKENIZED, TermVector.NO));
RAMDirectory ram = new RAMDirectory();
IndexWriter writer = new IndexWriter(ram, new StandardAnalyzer(), true);
writer.addDocument(doc);
writer.close();
Knowing a little bit about how lucene works, I have an idea why this fail, but
can we avoid this ?
Nicolas
--
Nicolas LALEVÉE
Solutions & Technologies
ANYWARE TECHNOLOGIES
Tel : +33 (0)5 61 00 52 90
Fax : +33 (0)5 61 00 51 46
http://www.anyware-tech.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Michael Busch <bu...@gmail.com>.
Doug Cutting wrote:
> Michael,
>
> This sounds like very good work. The back-compatibility of this
> approach is great. But we should also consider this in the broader
> context of index-format flexibility.
>
> Three general approaches have been proposed. They are not exclusive.
>
> 1. Make the index format extensible by adding user-implementable
> reader and writer interfaces for postings.
>
> 2. Add a richer set of standard index formats, including things like
> compressed fields, no-positions, per-position weights, etc.
>
> 3. Provide hooks for including arbitrary binary data.
>
> Your proposal is of type (3). LUCENE-662 is a (1). Approaches of
> type (2) are most friendly to non-Java implementations, since the
> semantics of each variation are well-defined.
>
> I don't see a reason not to pursue all three, but in a coordinated
> manner. In particular, we don't want to add a feature of type (3)
> that would make it harder to add type (1) APIs. It would thus be best
> if we had a rough specification of type (1) and type (2). A proposal
> of type (2) is at:
>
> http://wiki.apache.org/jakarta-lucene/FlexibleIndexing
>
> But I'm not sure that we yet have any proposed designs for an
> extensible posting API. (Is anyone aware of one?) This payload
> proposal can probably be easily incorporated into such a design, but I
> would have more confidence if we had one. I guess I should attempt one!
>
Doug,
thanks for your detailed response. I'm aware that the long-term goal is
the flexible index format and I see the payloads patch only as a part of
it. The patch focuses on extending the index data structures and about a
possible payload encoding. It doesn't focus yet on a flexible API, it
only offers the two mentioned low-level methods to add and retrieve byte
arrays.
I would love to work with you guys on the flexible index format and to
combine my patch with your suggestions and the patch from Nicolas! I
will look at your proposal and Nicolas' patch tomorrow (have to go now).
I just attached my patch (LUCENE-755), so if you get a chance you could
take a look at it.
Maybe it would make sense now to follow your suggestion you made earlier
this year and start a new package to work on the new index format? On
the other hand, if people would like to use the payloads soon I guess
due to the backwards compatibility it would be low risk to add it to the
current index format to provide this feature until we can finish the
flexible format?
- Michael
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Doug Cutting <cu...@apache.org>.
Michael Busch wrote:
> Some weeks ago I started working on an improved design which I would
> like to propose now. The new design simplifies the API extensions (the
> Field API remains unchanged) and uses less disk space in most use cases.
> Now there are only two classes that get new methods:
> - Token.setPayload()
> Use this method to add arbitrary metadata to a Token in the form of a
> byte[] array.
>
> - TermPositions.getPayload()
> Use this method to retrieve the payload of a term occurrence.
Michael,
This sounds like very good work. The back-compatibility of this
approach is great. But we should also consider this in the broader
context of index-format flexibility.
Three general approaches have been proposed. They are not exclusive.
1. Make the index format extensible by adding user-implementable reader
and writer interfaces for postings.
2. Add a richer set of standard index formats, including things like
compressed fields, no-positions, per-position weights, etc.
3. Provide hooks for including arbitrary binary data.
Your proposal is of type (3). LUCENE-662 is a (1). Approaches of type
(2) are most friendly to non-Java implementations, since the semantics
of each variation are well-defined.
I don't see a reason not to pursue all three, but in a coordinated
manner. In particular, we don't want to add a feature of type (3) that
would make it harder to add type (1) APIs. It would thus be best if we
had a rough specification of type (1) and type (2). A proposal of type
(2) is at:
http://wiki.apache.org/jakarta-lucene/FlexibleIndexing
But I'm not sure that we yet have any proposed designs for an extensible
posting API. (Is anyone aware of one?) This payload proposal can
probably be easily incorporated into such a design, but I would have
more confidence if we had one. I guess I should attempt one!
Here's a very rough, sketchy, first draft of a type (1) proposal.
IndexWriter#setPostingFormat(PostingFormat)
IndexWriter#setDictionaryFormat(DictionaryFormat)
interface PostingFormat {
PostingInverter getInverter(FieldInfo, Segment, Directory);
PostingReader getReader(FieldInfo, Segment, Directory);
PostingWriter getWriter(FieldInfo, Segment, Directory);
}
interface PostingPointer {} ???
interface DictionaryFormat {
DictionaryWriter getWriter(FieldInfo, Segment, Directory);
DictionaryWriter getReader(FieldInfo, Segment, Directory);
}
IndexWriter#addDocument(Document doc)
loop over doc.fields
call PostingFormat#getPostingInverter(FieldInfo, Segment, Directory)
to create a PostingInverter
if field is analyzed
call Analyzer#tokenStream() to get TokenStream
loop over tokens
PostingInverter#collectToken(Token, Field);
else
PostingInverter#collectToken(Field);
call DictionaryFormat#getWriter(FieldInfo, Segment, Directory)
to create a DictionaryWriter
Iterator<Term> terms = PostingInverter#getTerms();
loop over terms
PostingPointer p = PostingInverter#getPointer();
PostingInverter#write(term);
DictionaryWriter#addTerm(term, p);
IndexMerger#mergePostings()
call DictionaryFormat#getReader(FieldInfo, Segment, Directory)
to create a DictionaryReader
loop over fields
call PostingFormat#getWriter(FieldInfo, Segment, Directory)
to create a PostingWriter
loop over segments
call PostingFormat#getReader(FieldInfo, Segment, Directory)
to create a PostingReader
loop over dictionary.terms
PostingPointer p = PostingWriter#getPointer();
DictionaryWriter#addTerm(Term, p);
loop over docs
int doc = PostingReader#readPostings();
PostingWriter#writePostings(doc);
So the question is, does something like this conflict with your
proposal? Should Term and/or Token be extensible? If so, what should
their interfaces look like?
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org