You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Michael Busch <bu...@gmail.com> on 2006/06/29 23:22:42 UTC

Flexible index format / Payloads Cont'd

Hi everyone,

I'm working for IBM and started recently looking into Lucene.
I am very interested in the topic "flexible indexing / payloads",
that was discussed a couple of times in the last two months. I
did some investigation in the mailing lists, and found several
threads about this topic. Those threads didn't really lead to a
conclusion. That's my reason for starting this new thread: I hope
to get an understanding about:
   - Who is working on this feature?
   - Is there a concrete design?
   - Which functions/changes will the implementation include?
Furthermore, I would like to describe the work I did so far on
this feature.


To sum up the recent discussions, I'm going to list the different
threads about this topic:

--> There is a page in the Lucene Wiki to plan / track this topic:
    http://wiki.apache.org/jakarta-lucene/FlexibleIndexing

--> May 08, 2006 - May 10, 2006   
    
http://wiki.apache.org/jakarta-lucene/ConversationsBetweenDougMarvinAndGrant

    - Grant Ingersoll mentions, that he is interested in working
      on this topic.
    - Doug suggests to have docs, frequencies, positions, and
      norms in one postings-file (freqs, pos, and norms optional).
      A suggested file format for such a postings-file can be found
      on the mentioned Wiki page.

--> May 28, 2006 - May 31, 2006   
    
http://www.gossamer-threads.com/lists/lucene/java-dev/36039?search_string=lucene%20planning;#36039

    - Nadav Har'El suggests to have arbitrary data associated with
      each posting, i. e. a variable-length payload stored with
      each position, an idea Nadav and I discussed earlier. Doug
      voted +1 for this idea.

--> May 31, 2006 - Jun  2, 2006   
    
http://www.gossamer-threads.com/lists/lucene/java-dev/36210?search_string=flexible%20indexing;#36210
   
    - Marvin Humphrey talks about pluggable PostingsWriter/Reader,
      to make the postings file customizable. Marvin goes a step
      further and suggests to use plugins also for other index files.
     

I have the feeling, that many people are interested in having a
flexible index format. There are already various use cases:
   - Efficient parametric search
   - XML search
   - Part Of Speech (POS) annotations with each position
   - Multi-faceted search
   - ...

But I also have the feeling, that no clear course of action has
been defined yet, because this issue is quite complex since
it is not so easy to generalize the index data structures to
satisfy all demands/use cases, while maintaining the
straightforwardness of Lucene.


In the following I would like to describe the work I did so far
on this issue and propose a strategy on how to work on it in the
future to get the complexity under control.

I have made a prototype implementation of payloads. In my approach
I leave the frequency file as is and only change the positions file.
I can store a variable length payload (byte[]) with each position.
The payloads can be enabled/disabled on field level. The API changes
include:
  - new Field constructor, that takes a Payload as additional data
  - a Token stores a Payload, so an analyzer can produce tokens with
    arbitrary payloads
  - TermPositions got a getPayload method()

This prototype works very well, and we use it to play around with
multi-faceted search. But I think I should go a bit further, and
merge the frequency and position files into a single postings file,
which seemed to be the opinion in the mailing list threads.


I would suggest to split up the whole work to have smaller work items
and to have clearly defined milestones. Thus I suggest the
following steps:
1. Introduce postings file with the following format:
   <DocDelta, Payload>*
     DocDelta --> VInt
     DocDelta/2 is the difference between this document number and
     the previous document number.
 
     Payload --> Byte, if DocDelta is even
     Payload --> <Payload_Length, Payload_Data>, if DocDelta is odd
       Payload_Length --> VInt
       Payload_Data   --> Byte^Payload_Length

   Furthermore, it should be possible to enabled/disable payloads
   on field level.

2. Add multilevel skipping (tree structure) for the postings-file.
   One-level skipping, as being used now in Lucene, is probably
   not efficient enough for the new postings file, because it can
   be very big. Question: Should we include skipping information
   directly in the postings file, or should we introduce a new file
   containing the skipping infos? I think it should improve cache
   performance to have the skip tree in a different file.

3. Optional: Add a type-system for the payloads to make it
   easier to develop PostingsWriter/Reader plugins.

4. Make the PostingsWriter/Reader pluggable and develop default
   PostingsWriter/Reader plugins, that store frequencies, positions,
   and norms as payloads in the postings file. Should be configurable,
   to enable the different options Doug suggested:
 
   a. <doc>+
   b. <doc, boost>+
   c. <doc, freq, <position>+ >+
   d. <doc, freq, <position, boost>+ >+

5. Develop new or extend existing PostingsWriter/Reader plugins for
   desired features like XML search, POS, multi-faceted search, ...


Please let me know what you think about my suggestions. If people
like this approach, then I can add the information to the Wiki
planning page and start working on it.


Best Regards,
  Michael Busch


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible index format / Payloads Cont'd

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Jun 30, 2006, at 1:55 AM, Michael Busch wrote:

> So adding this payload feature to the Lucene core for a release 2.X
> is not a big risk in my opinion for the following reasons:
>   - API only extended
>   - Lucene 2.X will be able to read an index created with an earlier
>     version, because the Payload bit in FieldInfos will always be 0  
> then.
>   - Payloads are disabled by default. They will only be enabled by  
> using the
>     new API.
>   - If Payloads are disabled, then Lucene 2.0 is able to read an index
>     created with Lucene 2.X, because the file formats don't change  
> at all in
>     that case.
>
> So we could go ahead and add this to 2.X and keep working on the more
> fundamental changes for Lucene 3. Sounds like a plan?

Michael, FWIW, I wouldn't support this change, though I think it  
would make a good addition to a plugin regime, and we all benefit  
from seeing such a finely-crafted proposal.

First, I wouldn't add *any* more flags to Field.  IMO, it's gotten  
too big, and it's time to refactor by replacing conditionals with  
polymorphism.  Second, I think the payload mechanism should be  
introduced into Lucene as a private API.  If and when it gets exposed  
as public, it will have gone through some refinement and refactoring  
first.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible index format / Payloads Cont'd

Posted by Michael Busch <bu...@gmail.com>.
Marvin Humphrey wrote:
>
> Personally, I'm less interested in adding new features than I am in 
> solidifying and improving the core.
>
> The benefits I care about are:
>
>   * Decouple Lucene from it's file format.
>     o Make back-compatibility easier.
>     o Make refactoring easier.
>     o All the other goodness that comes with loose coupling.
>   * Improve IR precision, by writing a Boolean Scorer that
>     takes position into account, a la Brin/Page '98.
>   * Decrease time to launch a Searcher from rest.
>   * Simplify Lucene, conceptually.
>     o Indexes would have three parts: Term dictionary,
>       Postings, and Storage.
>     o Each part could be pluggable, following this format:
>       <header><object>+
>       * The de-serialization for each object is determined by
>         a plugin spec'd in the header.
>       * It's probably better to have separate header and data
>         files. 

> 3. Optional: Add a type-system for the payloads to make it
>>   easier to develop PostingsWriter/Reader plugins.
>
> IMO, this should wait.  It's going to be freakishly difficult to get 
> this stuff to work and maintain the commitments that Doug has laid out 
> for backwards compatibility.  There's also going to be trade-offs, and 
> so I'd anticipate contentious, interminable debate along the lines of 
> the recent Java 1.4/1.5 thread once there's real code and it becomes 
> clear who's lost a clock tick or two.
>
> Actually, I think pushing this forward is going to be so difficult, 
> that I'll be focusing my attentions on implementing it elsewhere.

I understand that backward compatibility is a big concern. Doug pointed
out, that Y.X+1 versions should be backward compatible to Y.X. The
things we talk about (fundamental change of index data structures,
plugins) will break the compatibility, so should be targeted for Lucene 3.

To have payloads in a earlier release 2.X, we could go a simpler way and
use the implementation I've done so far and which I'll finish soon. In the
following I'm going to describe this implementation in detail.

* File changes
   - Field Infos
     I'm using the 6th lowest order Bit of FieldBits, which is currently
     unused, to store whether payloads are enabled for a certain field.
   - Positions file
     For fields with disabled payloads, the format of the positions file
     does not change at all. If payloads are enabled, than a variable
     length payload is being stores with each position:

     ProxFile (.prx) --> <TermPositions>^TermCount
     TermPositions   --> <Positions>^DocFreq
     Positions       --> <PositionDelta, Payload>^Freq
     PositionDelta   --> VInt
     Payload         --> Byte+   

     Encoding of the Payload:
     If the payload is only one byte long then
        - if the value of the byte is <128, then this byte is stored as is
        - if the value of the byte is >=128, then a byte 10000001 (0x81)
          is stored, followed by the payload byte itself
     If the payload is longer than one byte but <127 then
        - a byte (0x80 | length) is stored, followed by the payload bytes
     If the payload is length is >=127 then
        - the payload_length-127 is stored as a VInt, followed by the 
payload
          bytes
     If the payload length is 0, then
        - one byte 0x80 is stored. This is being done to distinguish a
          payload with length=0 from a payload with length=1 and value=0
       

* API changes
   - org.apache.lucene.index.Payload
     Added this class with the following constructor and getter method:
     * public Payload(byte[] value);
     * public byte[] getValue();

   - org.apache.lucene.analysis.Token
     Added two new constructors and getter/setter:
     * public Token(String text, int start, int end, Payload payload);
     * public Token(String text, int start, int end, String typ,
                    Payload payload);
     * public Payload getPayload();
     * public void setPayload(Payload payload);


   - org.apache.lucene.document.Field
     Added PayloadParameter.YES/.NO to indicate whether Field stores 
payloads
     and added new constructors to create a field with payloads enabled:
     * public Field(String name, String value, Store store, Index index,
                    TermVector termVector, PayloadParameter payloadParam);
     * public Field(String name, String value, Store store, Index index,
                    TermVector termVector, Payload payload);
     * public Field(String name, Reader reader, TermVector termVector,
                    PayloadParameter payloadParam);

     Furthermore:
     * public Payload getPayload();
     * public boolean isPayloadStored();

   - org.apache.lucene.index.TermPositions
     Added the new method:
     * public Payload getPayload() throws IOException;
     Remark: In contrast to nextPosition(), this method does not move 
the pointer
             in the prox file. Therefore it should always be called after
             nextPosition().


So adding this payload feature to the Lucene core for a release 2.X
is not a big risk in my opinion for the following reasons:
   - API only extended
   - Lucene 2.X will be able to read an index created with an earlier
     version, because the Payload bit in FieldInfos will always be 0 then.
   - Payloads are disabled by default. They will only be enabled by 
using the
     new API.
   - If Payloads are disabled, then Lucene 2.0 is able to read an index
     created with Lucene 2.X, because the file formats don't change at 
all in
     that case.

So we could go ahead and add this to 2.X and keep working on the more
fundamental changes for Lucene 3. Sounds like a plan?

>
>
>> 5. Develop new or extend existing PostingsWriter/Reader plugins for
>>   desired features like XML search, POS, multi-faceted search, ...
>
> People will definitely want to scratch their own itches, but I'd argue 
> that this stuff should start out private.  And maybe stay that way!

I agree with that. We should focus on improving the Lucene core and start
offering a flexible payload mechanism, so that people can start developing
their own stuff. Later, if people submit good solutions, those might be
good candidates for contrib.

>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
Regards,
  Michael Busch

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible index format / Payloads Cont'd

Posted by Daniel John Debrunner <dj...@apache.org>.
Marvin Humphrey wrote:

> 
> On Jun 30, 2006, at 6:32 AM, Daniel John Debrunner wrote:
> 
>> Marvin Humphrey wrote:
>>
>>> IMO, this should wait.  It's going to be freakishly difficult to get
>>> this stuff to work and maintain the commitments that Doug has  laid  out
>>> for backwards compatibility.
>>
>>
>> For newcomers to the project is there a link to these commitments?
>> I looked aorund the Lucene site and searched the archives for the
>> java-dev list but nothing obvious came up.
> 
> 
> Here's the relevant java-dev post:
> 
> http://xrl.us/ntmw (Link to mail-archives.apache.org)

Thanks very much, I'll investigate and add a page to the wiki (unless
one already exists).

Dan.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible index format / Payloads Cont'd

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Jun 30, 2006, at 6:32 AM, Daniel John Debrunner wrote:

> Marvin Humphrey wrote:
>
>> IMO, this should wait.  It's going to be freakishly difficult to get
>> this stuff to work and maintain the commitments that Doug has  
>> laid  out
>> for backwards compatibility.
>
> For newcomers to the project is there a link to these commitments?
> I looked aorund the Lucene site and searched the archives for the
> java-dev list but nothing obvious came up.

Here's the relevant java-dev post:

http://xrl.us/ntmw (Link to mail-archives.apache.org)

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible index format / Payloads Cont'd

Posted by Daniel John Debrunner <dj...@apache.org>.
Marvin Humphrey wrote:

> IMO, this should wait.  It's going to be freakishly difficult to get 
> this stuff to work and maintain the commitments that Doug has laid  out
> for backwards compatibility.

For newcomers to the project is there a link to these commitments?
I looked aorund the Lucene site and searched the archives for the
java-dev list but nothing obvious came up.

Dan.









---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Proximity-enhanced boolean scoring (was: Re: Flexible index format / Payloads Cont'd)

Posted by Nadav Har'El <ny...@math.technion.ac.il>.
On Wed, Jul 05, 2006, Paul Elschot wrote about "Re: Flexible index format / Payloads Cont'd":
> > Ok, then, I thought to myself - the normal queries and scorers only work
> > on the document level and don't use positions - but SpanQueries have 
> positions
> > so I can create some sort of ProximityBooleanSpanQuery, right? Well,
> > unfortunately, I couldn't figure out how, yet. SpanScorer is a Scorer
>...
> 
> SpanQueries can be nested because they pass around
> Spans to higher levels for scoring at the top level of the proximity.

Ok, I've started writing a class I call ProximityBooleanQuery, which unlike
BooleanQuery will need to contain SpanQueries, not just any Query, as clauses.

My idea is that ProximityBooleanQuery's Scorer will sum the scores of the
individual clauses (just like in BooleanQuery), but further increase the
score depending on how matches we find nearby in the same document (to figure
this out, I will use the Spans contained in the subdocuments).

One of the peculiar things I noticed while experimenting with this approach
is that SpanTermQuery's scorer is different from the regular TermQuery's
scorer - its scores always appear multiplied by 1/sqrt(2) compared to
TermQuery's scores. Is this deliberate? If not, should it perhaps be fixed?

> So a minimum form of "ProximityBooleanSpanQuery" is already there
> in Lucene. It is implemented by using a SpanScorer as a subscorer
> of a BooleanScorer2, and by having this SpansScorer use the proximity
> information passed up from the bottom level SpanTermQueries, normally
> via some other SpanQuery like SpanNearQuery.

I'm not sure I understand what you mean. Perhaps you mean something like
the simple solution I described in a previous mail, where I added to a
normal BooleanQuery several additional SpanNearQueries, one for each pair
of terms in the query. This solution works quite well, but I thought it
is inefficient which is why I was looking to come up with a more basic
solution.

> It might be possible to subclass Scorer to incorporate more position info,
> but SpanQueries have a slightly different take, they use Spans to pass 
> the position info around.
> This is also the reason why Lucene has some difficulty in weighting
> the subqueries of a SpanQuery: unlike a Scorer, a Spans does not have
> a score or weight value, and SpanScorer is used to provide the score, but
> only at the top level of the proximity structure.
> This could be changed adding a weight to Spans, or by adding some
> form of position info to (a subclass of) Scorer.

Yes, I think you described the situation well. At this stage, I'll continue
to try to develop this feature using Lucene's existing Spans/SpanQuery
framework. I hope this is possible, because the ideas you raised (adding
weight to Spans or spans to Scorer) will require significant changes to
many of Lucene's existing query types, or duplication of these query
types, something which I'd rather avoid if possible.

-- 
Nadav Har'El                        |     Thursday, Jul 6 2006, 10 Tammuz 5766
IBM Haifa Research Lab              |-----------------------------------------
                                    |Cement mixer collided with a prison van.
http://nadav.harel.org.il           |Look out for sixteen hardened criminals.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible index format / Payloads Cont'd

Posted by Paul Elschot <pa...@xs4all.nl>.
On Tuesday 04 July 2006 23:51, Nadav Har'El wrote:
...
> The problem is that Scorer, and it's implementations - BooleanScorer2,
> DisjunctionSumScorer and ConjunctionScorer - only work on the document
> level. Scorer has next() and skipTo(), but no way to view positions
> inside the document. If you look at the lowest level Scorer, TermScorer,
> it uses TermDocs and not TermPositions.
> So I couldn't figure out a way to hack on BooleanScorer2 to change the
> score by positions.
> 
> Ok, then, I thought to myself - the normal queries and scorers only work
> on the document level and don't use positions - but SpanQueries have 
positions
> so I can create some sort of ProximityBooleanSpanQuery, right? Well,
> unfortunately, I couldn't figure out how, yet. SpanScorer is a Scorer
> as usual, and still doesn't have access to the positions. It does keep
> "spans", and gives a score according to their lengths, but I couldn't
> figure out how I could use this facility to do what we want.

SpanQueries can be nested because they pass around
Spans to higher levels for scoring at the top level of the proximity.
At the bottom level there is SpanTermQuery, which uses the positions
in the following way to create its Spans:

        public int doc() { return doc; }
        public int start() { return position; }
        public int end() { return position + 1; }

For the index format, the most interesting thing is what is not present
here: a weight per position.
Also, there is some redundancy in start() and end() here, but this is the
price of allowing nesting of SpanQueries.
All other SpanQueries combine these into other Spans, normally
with more distance between start() and end(). They also filter out
the Spans that do not match the query, for example SpanNearQuery.
The top level of the proximity query, a Spans is scored by SpanScorer,
to give a score value per document. 

So a minimum form of "ProximityBooleanSpanQuery" is already there
in Lucene. It is implemented by using a SpanScorer as a subscorer
of a BooleanScorer2, and by having this SpansScorer use the proximity
information passed up from the bottom level SpanTermQueries, normally
via some other SpanQuery like SpanNearQuery.

It might be possible to subclass Scorer to incorporate more position info,
but SpanQueries have a slightly different take, they use Spans to pass 
the position info around.
This is also the reason why Lucene has some difficulty in weighting
the subqueries of a SpanQuery: unlike a Scorer, a Spans does not have
a score or weight value, and SpanScorer is used to provide the score, but
only at the top level of the proximity structure.
This could be changed adding a weight to Spans, or by adding some
form of position info to (a subclass of) Scorer.

> 
> Lastly, I looked at what LoosePhraseScorer looks like, to understand how
> phrases do get to use positions. It appears that this scorer gets 
initialized
> with the TermPositions of each term, which includes the positions. This
> is great, but it means that it a phrase can only contain terms (words) -
> LoosePhraseScorer could not handle more complex sub-queries, and their

PhraseScorers cannot be nested because they do not provide a Spans.
However, they might be extended to provide a Spans, and this would be
somewhat more efficient because the redundancy in start() and end() of
the Spans of the SpanTermQueries would be avoided.

> own Scorers. But it would have been nice if the proximity-enhanced boolean
> query could support not just term sub-queries.

How would you like the proximity information for nested proximity queries to
to be passed around for scoring?
Using Spans is one way, but there are more, especially when a weight
per position becomes available.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible index format / Payloads Cont'd

Posted by Nadav Har'El <ny...@math.technion.ac.il>.
On Fri, Jun 30, 2006, Marvin Humphrey wrote about "Re: Flexible index format / Payloads Cont'd":
> >On Thu, Jun 29, 2006, Marvin Humphrey wrote about "Re: Flexible  
> >index format / Payloads Cont'd":
> >>  * Improve IR precision, by writing a Boolean Scorer that
> >>    takes position into account, a la Brin/Page '98.
> >
> >Yes, I'd love to see that too (and it doesn't even require any new  
> >payloads
> >support, the positions that Lucene already has are enough).
> 
> True.  Any intrepid volunteers jonesing to hack on BooleanScorer2?   
> Yeeha!

I felt somewhat intrepid, so I decided to try and do this myself.
Unfortunately, it turns out to be much more complicated than I thought.

The problem is that Scorer, and it's implementations - BooleanScorer2,
DisjunctionSumScorer and ConjunctionScorer - only work on the document
level. Scorer has next() and skipTo(), but no way to view positions
inside the document. If you look at the lowest level Scorer, TermScorer,
it uses TermDocs and not TermPositions.
So I couldn't figure out a way to hack on BooleanScorer2 to change the
score by positions.

Ok, then, I thought to myself - the normal queries and scorers only work
on the document level and don't use positions - but SpanQueries have positions
so I can create some sort of ProximityBooleanSpanQuery, right? Well,
unfortunately, I couldn't figure out how, yet. SpanScorer is a Scorer
as usual, and still doesn't have access to the positions. It does keep
"spans", and gives a score according to their lengths, but I couldn't
figure out how I could use this facility to do what we want.

Lastly, I looked at what LoosePhraseScorer looks like, to understand how
phrases do get to use positions. It appears that this scorer gets initialized
with the TermPositions of each term, which includes the positions. This
is great, but it means that it a phrase can only contain terms (words) -
LoosePhraseScorer could not handle more complex sub-queries, and their
own Scorers. But it would have been nice if the proximity-enhanced boolean
query could support not just term sub-queries.

> Right now, the boolean scorers scan through freqs for all terms, but  
> positions for only some terms.  For common terms, which is where the  
> bulk of the cost lies in scoring, scanning though both freqs and  
> positions involves a number of disk seeks, as .frq and .prx are  
> consumed in 1k chunks.  This is an area where OS caching is unlikely  
> to help too much, as we're talking about a lot of data.

I'm not sure I follow. You say the boolean scorers currently scan positions
for some terms. I didn't see this happening. Or, do you mean in case one
of the clauses is, say, a phrase, in which case the sub-scorer is the one
that scans positions?

> A boolean scorer requiring that positions be read for *all* terms  
> will cost more.  However, by merging the freq and prox files, those  
> disk seeks are eliminated, as all the freq/prox data for a term can  
> be slurped up in one contiguous read.  That may serve to mitigate the  
> costs some.

You are absolutely right. 

> However, simple term queries, at least those against fields where  
> positions are stored, will cost more -- because it will be necessary  
> to scan past irrelevant positional data.  I think people who do a lot  
> of yes/no, unscored matches might be unhappy about that.

I think that just like we can say for a certain field that it shouldn't
have norms, it should be possible to say about a certain field that it
doesn't have positions. Consider a case where you know in advance that
a field's value will only be used to filter results, E.g., the field's
value is a list of categories the document belongs to. You never intend
to use this field in a phrase search or even to score matches, so you
simply don't need to store positions.
Such a simple posting list, containing just the list of documents without
any intra-document positions or payloads is sometimes known as "filter
posting list" or "binary posting list".

This per-field flag, "no positions", should be part of the general redesign
of the index structure that will also allow for per-document and per-position
payloads.

> One more note: Though payloads are not necessary for exploiting  
> positional data, associating a boost with each position opens the  
> door to an additional improvement in IR precision.  The Googs, for  
> instance, describe dedicating 4-8 bits per posting to text size, so  
> that e.g. text between <h1> tags gets weighted more heavily than text  
> between <p> tags.

Indeed.

If you want a "poor man's version" of their capability, before per-position
payloads are added to lucene, you can try this simple trick: double every
word inside the <h1>. This will give these words a boost compared to the
other words. Of course, it's easy to double words but you can't do this
with fractions like 1.5 :-)


-- 
Nadav Har'El                        |     Wednesday, Jul 5 2006, 9 Tammuz 5766
IBM Haifa Research Lab              |-----------------------------------------
                                    |Why do we drive on a parkway and park on
http://nadav.harel.org.il           |a driveway?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible index format / Payloads Cont'd

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Jun 30, 2006, at 6:07 AM, Nadav Har'El wrote:

> On Thu, Jun 29, 2006, Marvin Humphrey wrote about "Re: Flexible  
> index format / Payloads Cont'd":
>>   * Improve IR precision, by writing a Boolean Scorer that
>>     takes position into account, a la Brin/Page '98.
>
> Yes, I'd love to see that too (and it doesn't even require any new  
> payloads
> support, the positions that Lucene already has are enough).

True.  Any intrepid volunteers jonesing to hack on BooleanScorer2?   
Yeeha!

The reason I included this in my summary rather than separating it  
out into something we could do earlier was locality of reference.

Right now, the boolean scorers scan through freqs for all terms, but  
positions for only some terms.  For common terms, which is where the  
bulk of the cost lies in scoring, scanning though both freqs and  
positions involves a number of disk seeks, as .frq and .prx are  
consumed in 1k chunks.  This is an area where OS caching is unlikely  
to help too much, as we're talking about a lot of data.

A boolean scorer requiring that positions be read for *all* terms  
will cost more.  However, by merging the freq and prox files, those  
disk seeks are eliminated, as all the freq/prox data for a term can  
be slurped up in one contiguous read.  That may serve to mitigate the  
costs some.

However, simple term queries, at least those against fields where  
positions are stored, will cost more -- because it will be necessary  
to scan past irrelevant positional data.  I think people who do a lot  
of yes/no, unscored matches might be unhappy about that.

Generally, I'm concerned about anyone who has fine-tuned their system  
for search-time throughput.  Adding additional search-time costs may  
push some of these systems over the edge.  As a total package, I  
think the power of the changes easily justifies the price, and  
furthermore, IR precision cannot be bought with more hardware, while  
throughput can.  But I suspect there will be some interested parties  
who will disagree, and I'm sympathetic -- it would be a real bummer  
if costly "improvements" to BooleanScorer2 made your app unworkable.

BooleanScorer3 anyone?  Oi.

> I tried a small test using the Trec 8 corpus and query-relevance  
> judgements,
> and saw a noticable improvement in precision when I added a simplistic
> version of this feature: I "or"ed the original query words with
> SpanNearQuery's of each pair of words in the query, so the query of
> "hot dog bun" will be converted to something similar to:
>
> 	hot OR dog OR bun OR "hot dog"~7^0.25 "dog bun"~7^0.25 "hot  
> bun"~7^0.25

Nifty example!

One more note: Though payloads are not necessary for exploiting  
positional data, associating a boost with each position opens the  
door to an additional improvement in IR precision.  The Googs, for  
instance, describe dedicating 4-8 bits per posting to text size, so  
that e.g. text between <h1> tags gets weighted more heavily than text  
between <p> tags.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible index format / Payloads Cont'd

Posted by Nadav Har'El <ny...@math.technion.ac.il>.
On Thu, Jun 29, 2006, Marvin Humphrey wrote about "Re: Flexible index format / Payloads Cont'd":
>   * Improve IR precision, by writing a Boolean Scorer that
>     takes position into account, a la Brin/Page '98.

Yes, I'd love to see that too (and it doesn't even require any new payloads
support, the positions that Lucene already has are enough).

I tried a small test using the Trec 8 corpus and query-relevance judgements,
and saw a noticable improvement in precision when I added a simplistic
version of this feature: I "or"ed the original query words with
SpanNearQuery's of each pair of words in the query, so the query of
"hot dog bun" will be converted to something similar to:

	hot OR dog OR bun OR "hot dog"~7^0.25 "dog bun"~7^0.25 "hot bun"~7^0.25

But this "solution" is obviously not the best we can do: it is inefficient
(goes through each posting list three times), and not tuned. A better solution
would be like you said, to create a modified version of BooleanQuery's
scoring.

-- 
Nadav Har'El                        |       Friday, Jun 30 2006, 4 Tammuz 5766
IBM Haifa Research Lab              |-----------------------------------------
                                    |Give Yogi a rifle. Support your right to
http://nadav.harel.org.il           |arm bears!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible index format / Payloads Cont'd

Posted by Nicolas Lalevée <ni...@anyware-tech.com>.
Hi,

Le Lundi 31 Juillet 2006 17:28, robert engels a écrit :
> Doing this beak compatibility with non-Java Lucene implementations.

For me, a such compatibilty is the file format one. Am I wrong ?
In such a case, I don't see any compatibilty break as the default 
implementation of FieldsDataWriter is a actual one. And if I generate an 
index with my custom writer, I will expect my index to be uncompatible with 
other implementation, even with other Java ones.

> Not sure it matters, but I thought I would point it out. I have
> always thought that Lucene should be compatible at an API level only,
> and MAYBE create a network access protocol for queries and updates.

I didn't talked about network access... I don't see your point...

>
> On Jul 31, 2006, at 10:25 AM, Nicolas Lalevée wrote:
> > Le Vendredi 21 Juillet 2006 12:37, Marvin Humphrey a écrit :
> >> On Jul 21, 2006, at 1:23 AM, Nicolas Lalevée wrote:
> >>> In fact, that was my first implementaion. The problem with that is
> >>> you can
> >>> only store one value. But thinking a little more about it, storing
> >>> one or
> >>> more value is not an issue, because with the solution I proposed,
> >>> no space is
> >>> saved at all.
> >>> In fact, when I thought about this format of field metadata, I was
> >>> thinking
> >>> about a way to make the Lucene user specify how to store it in the
> >>> Lucene
> >>> index format. For instance, the simple one would specify that it's
> >>> a pointeur
> >>> on some metadata (as you proposed), another one would specify that
> >>> there are
> >>> two pointeurs (in my use case, one for type, the other one for the
> >>> language),
> >>> and another one whould specify that it will be store directly as
> >>> it is
> >>> actually an integer (so no need to make a pointer on integer. But
> >>> it was just
> >>> a thought, I don't know if it is possible. WDYT ?
> >>
> >> I'm thinking that there would be a codecs file, say with the
> >> extension .cdx and this format:
> >>
> >>    Codecs (.cdx)  --> CodecCount, <CodecClassName>CodecCount
> >>    CodecCount     --> Uint32
> >>    CodecClassName --> String
> >>
> >> That file would be read in its entirety when the index was
> >> initialized and expanded into an array of codec objects, one per
> >> CodecClassName.
> >>
> >> The .fdx file would add an additional int per doc...
> >>
> >>    FieldIndex (.fdx) -->  <FieldValuesPosition,
> >>                            FieldValuesCodecNumber>SegSize
> >>    FieldValuesPosition    --> Uint64
> >>    FieldValuesCodecNumber --> Uint32
> >>
> >> Now, before you read any data from the .fdt file, you know how to
> >> interpret it.  You seek the .fdt IndexInput to the right spot, then
> >> feed it to the appropriate codec object from the codecs array.  The
> >> codec does the rest.  In your case, you might write a codec that
> >> would read a few bytes and strings of metadata up front.  Or you
> >> might have several different codecs, the identity of which indicates
> >> fixed values for certain metadata fields: FrenchDocument,
> >> ArabicDocument, etc.
> >>
> >> Would that scheme meet your needs?
> >
> > That looks good, but there is one restriction : it have to be per
> > document.
> > Let's explain a lit bit more my needs.
> >
> > In fact my app have to index some data which is structured in a RDF
> > graph.
> > Each rdf resource have a title and a description, each title and
> > description
> > being in different languages. The model we choose is to map a rdf
> > resource on
> > a document. Then the field name is the URI of the rdf property, and
> > the field
> > value is the litteral or other resource.
> > for instance :
> > doc1 : URI:http://foo.com   title:[en]foo   title:[fr]truc
> > So, in a document I will have several fields with different
> > languages. For my
> > use case, in fact I need only one "codec". It is a codec that will
> > get 3
> > values, 2 of them being optionnal : a language, a type, and a value.
> >
> > In fact I was thinking about a more generic version that will allow
> > the format
> > compatibility, keeping .fdx as is :
> >
> > FieldData (.fdt) -->  <DocFieldData>SegSize
> > DocFieldData --> FieldCount, <FieldNum, RawData>FieldCount
> >
> > And a default FieldsDataWriter will be the actual one, it will read
> > the
> > RawData as Bits, Value, with Value -->  String | BinaryValue,....
> > Then, for my app, I will provide some custom FieldsDataWriter that
> > will do
> > exactly what I want.
> >
> > What I don't know yet is how it breaks that API... because if I
> > want to
> > provide my own FieldsDataWriter, I would also want to have my own
> > implementation of Fieldable...
> > If you think this is a good idea, I will try to implement it.
> >
> > cheers,
> > Nicolas
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible index format / Payloads Cont'd

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Jul 31, 2006, at 8:25 AM, Nicolas Lalevée wrote:
>
> That looks good, but there is one restriction : it have to be per  
> document.

Yes, what I laid out was per-document - for each document, the fdx  
file would keep a file pointer and an integer mapping to a codec.

> In fact I was thinking about a more generic version that will allow  
> the format
> compatibility, keeping .fdx as is :
>
> FieldData (.fdt) -->  <DocFieldData>SegSize
> DocFieldData --> FieldCount, <FieldNum, RawData>FieldCount
>
> And a default FieldsDataWriter will be the actual one, it will read  
> the
> RawData as Bits, Value, with Value -->  String | BinaryValue,....
> Then, for my app, I will provide some custom FieldsDataWriter that  
> will do
> exactly what I want.

OK, that's quite similar, but with the info specifying how to  
deserialize the document stored in fdt rather than fdx.  However, I  
don't think what you're describing makes the field storage in Lucene  
arbitrarily extensible, since you're just going to override  
FieldsWriter/FieldsReader rather than modify them so that they can  
use arbitrary codecs.

I think what I want to do is turn Lucene into an Object-Oriented  
Database, or at least have Lucene adopt some characteristics of an  
ODBMS.  However, I haven't used a real ODBMS and I'm not up on the  
theory, so I can't say for sure.  I've been doing a little reading  
here and there on object databases, but I've been extraordinarily  
busy the last few weeks and haven't been able to study it in depth.

The main point is this:

Lucene users have diverse needs for what gets stored in the document/ 
field storage.  We've been meeting those needs by assigning more and  
more bit flags.  That can't continue that ad infinitum.  However, we  
*can* meet everyone's needs by applying a variant of the "Replace  
Conditionals With Polymorphism" refactoring technique...

http://xrl.us/p3kn (Link to www.eli.sdsu.edu)

Think of those bit flags as an if-else chain.  Instead of all those  
conditionals describing all the attributes of the Lucene Document you  
want to store at that file pointer, we allow you to put whatever kind  
of serialized object you desire there.  Maybe it's a Lucene  
Document.  Maybe it's a FrechDocument.  Maybe it's a  
RussianDocument.  Maybe it's a wrapped-up jpg.  You choose.

Instead of continually adding to the complexity of the  
deserialization algorithm, we we make that deserialization algorithm  
user-definable.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible index format / Payloads Cont'd

Posted by robert engels <re...@ix.netcom.com>.
Doing this beak compatibility with non-Java Lucene implementations.  
Not sure it matters, but I thought I would point it out. I have  
always thought that Lucene should be compatible at an API level only,  
and MAYBE create a network access protocol for queries and updates.

On Jul 31, 2006, at 10:25 AM, Nicolas Lalevée wrote:

> Le Vendredi 21 Juillet 2006 12:37, Marvin Humphrey a écrit :
>> On Jul 21, 2006, at 1:23 AM, Nicolas Lalevée wrote:
>>> In fact, that was my first implementaion. The problem with that is
>>> you can
>>> only store one value. But thinking a little more about it, storing
>>> one or
>>> more value is not an issue, because with the solution I proposed,
>>> no space is
>>> saved at all.
>>> In fact, when I thought about this format of field metadata, I was
>>> thinking
>>> about a way to make the Lucene user specify how to store it in the
>>> Lucene
>>> index format. For instance, the simple one would specify that it's
>>> a pointeur
>>> on some metadata (as you proposed), another one would specify that
>>> there are
>>> two pointeurs (in my use case, one for type, the other one for the
>>> language),
>>> and another one whould specify that it will be store directly as  
>>> it is
>>> actually an integer (so no need to make a pointer on integer. But
>>> it was just
>>> a thought, I don't know if it is possible. WDYT ?
>>
>> I'm thinking that there would be a codecs file, say with the
>> extension .cdx and this format:
>>
>>    Codecs (.cdx)  --> CodecCount, <CodecClassName>CodecCount
>>    CodecCount     --> Uint32
>>    CodecClassName --> String
>>
>> That file would be read in its entirety when the index was
>> initialized and expanded into an array of codec objects, one per
>> CodecClassName.
>>
>> The .fdx file would add an additional int per doc...
>>
>>    FieldIndex (.fdx) -->  <FieldValuesPosition,
>>                            FieldValuesCodecNumber>SegSize
>>    FieldValuesPosition    --> Uint64
>>    FieldValuesCodecNumber --> Uint32
>>
>> Now, before you read any data from the .fdt file, you know how to
>> interpret it.  You seek the .fdt IndexInput to the right spot, then
>> feed it to the appropriate codec object from the codecs array.  The
>> codec does the rest.  In your case, you might write a codec that
>> would read a few bytes and strings of metadata up front.  Or you
>> might have several different codecs, the identity of which indicates
>> fixed values for certain metadata fields: FrenchDocument,
>> ArabicDocument, etc.
>>
>> Would that scheme meet your needs?
>
> That looks good, but there is one restriction : it have to be per  
> document.
> Let's explain a lit bit more my needs.
>
> In fact my app have to index some data which is structured in a RDF  
> graph.
> Each rdf resource have a title and a description, each title and  
> description
> being in different languages. The model we choose is to map a rdf  
> resource on
> a document. Then the field name is the URI of the rdf property, and  
> the field
> value is the litteral or other resource.
> for instance :
> doc1 : URI:http://foo.com   title:[en]foo   title:[fr]truc
> So, in a document I will have several fields with different  
> languages. For my
> use case, in fact I need only one "codec". It is a codec that will  
> get 3
> values, 2 of them being optionnal : a language, a type, and a value.
>
> In fact I was thinking about a more generic version that will allow  
> the format
> compatibility, keeping .fdx as is :
>
> FieldData (.fdt) -->  <DocFieldData>SegSize
> DocFieldData --> FieldCount, <FieldNum, RawData>FieldCount
>
> And a default FieldsDataWriter will be the actual one, it will read  
> the
> RawData as Bits, Value, with Value -->  String | BinaryValue,....
> Then, for my app, I will provide some custom FieldsDataWriter that  
> will do
> exactly what I want.
>
> What I don't know yet is how it breaks that API... because if I  
> want to
> provide my own FieldsDataWriter, I would also want to have my own
> implementation of Fieldable...
> If you think this is a good idea, I will try to implement it.
>
> cheers,
> Nicolas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible index format / Payloads Cont'd

Posted by Nicolas Lalevée <ni...@anyware-tech.com>.
Le Vendredi 21 Juillet 2006 12:37, Marvin Humphrey a écrit :
> On Jul 21, 2006, at 1:23 AM, Nicolas Lalevée wrote:
> > In fact, that was my first implementaion. The problem with that is
> > you can
> > only store one value. But thinking a little more about it, storing
> > one or
> > more value is not an issue, because with the solution I proposed,
> > no space is
> > saved at all.
> > In fact, when I thought about this format of field metadata, I was
> > thinking
> > about a way to make the Lucene user specify how to store it in the
> > Lucene
> > index format. For instance, the simple one would specify that it's
> > a pointeur
> > on some metadata (as you proposed), another one would specify that
> > there are
> > two pointeurs (in my use case, one for type, the other one for the
> > language),
> > and another one whould specify that it will be store directly as it is
> > actually an integer (so no need to make a pointer on integer. But
> > it was just
> > a thought, I don't know if it is possible. WDYT ?
>
> I'm thinking that there would be a codecs file, say with the
> extension .cdx and this format:
>
>    Codecs (.cdx)  --> CodecCount, <CodecClassName>CodecCount
>    CodecCount     --> Uint32
>    CodecClassName --> String
>
> That file would be read in its entirety when the index was
> initialized and expanded into an array of codec objects, one per
> CodecClassName.
>
> The .fdx file would add an additional int per doc...
>
>    FieldIndex (.fdx) -->  <FieldValuesPosition,
>                            FieldValuesCodecNumber>SegSize
>    FieldValuesPosition    --> Uint64
>    FieldValuesCodecNumber --> Uint32
>
> Now, before you read any data from the .fdt file, you know how to
> interpret it.  You seek the .fdt IndexInput to the right spot, then
> feed it to the appropriate codec object from the codecs array.  The
> codec does the rest.  In your case, you might write a codec that
> would read a few bytes and strings of metadata up front.  Or you
> might have several different codecs, the identity of which indicates
> fixed values for certain metadata fields: FrenchDocument,
> ArabicDocument, etc.
>
> Would that scheme meet your needs?

That looks good, but there is one restriction : it have to be per document. 
Let's explain a lit bit more my needs.

In fact my app have to index some data which is structured in a RDF graph. 
Each rdf resource have a title and a description, each title and description 
being in different languages. The model we choose is to map a rdf resource on 
a document. Then the field name is the URI of the rdf property, and the field 
value is the litteral or other resource.
for instance :
doc1 : URI:http://foo.com   title:[en]foo   title:[fr]truc
So, in a document I will have several fields with different languages. For my 
use case, in fact I need only one "codec". It is a codec that will get 3 
values, 2 of them being optionnal : a language, a type, and a value.

In fact I was thinking about a more generic version that will allow the format 
compatibility, keeping .fdx as is :

FieldData (.fdt) -->  <DocFieldData>SegSize
DocFieldData --> FieldCount, <FieldNum, RawData>FieldCount

And a default FieldsDataWriter will be the actual one, it will read the 
RawData as Bits, Value, with Value -->  String | BinaryValue,....
Then, for my app, I will provide some custom FieldsDataWriter that will do 
exactly what I want.

What I don't know yet is how it breaks that API... because if I want to 
provide my own FieldsDataWriter, I would also want to have my own 
implementation of Fieldable...
If you think this is a good idea, I will try to implement it.

cheers,
Nicolas

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible index format / Payloads Cont'd

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Jul 21, 2006, at 1:23 AM, Nicolas Lalevée wrote:
> In fact, that was my first implementaion. The problem with that is  
> you can
> only store one value. But thinking a little more about it, storing  
> one or
> more value is not an issue, because with the solution I proposed,  
> no space is
> saved at all.
> In fact, when I thought about this format of field metadata, I was  
> thinking
> about a way to make the Lucene user specify how to store it in the  
> Lucene
> index format. For instance, the simple one would specify that it's  
> a pointeur
> on some metadata (as you proposed), another one would specify that  
> there are
> two pointeurs (in my use case, one for type, the other one for the  
> language),
> and another one whould specify that it will be store directly as it is
> actually an integer (so no need to make a pointer on integer. But  
> it was just
> a thought, I don't know if it is possible. WDYT ?

I'm thinking that there would be a codecs file, say with the  
extension .cdx and this format:

   Codecs (.cdx)  --> CodecCount, <CodecClassName>CodecCount
   CodecCount     --> Uint32
   CodecClassName --> String

That file would be read in its entirety when the index was  
initialized and expanded into an array of codec objects, one per  
CodecClassName.

The .fdx file would add an additional int per doc...

   FieldIndex (.fdx) -->  <FieldValuesPosition,
                           FieldValuesCodecNumber>SegSize
   FieldValuesPosition    --> Uint64
   FieldValuesCodecNumber --> Uint32

Now, before you read any data from the .fdt file, you know how to  
interpret it.  You seek the .fdt IndexInput to the right spot, then  
feed it to the appropriate codec object from the codecs array.  The  
codec does the rest.  In your case, you might write a codec that  
would read a few bytes and strings of metadata up front.  Or you  
might have several different codecs, the identity of which indicates  
fixed values for certain metadata fields: FrenchDocument,  
ArabicDocument, etc.

Would that scheme meet your needs?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible index format / Payloads Cont'd

Posted by Nicolas Lalevée <ni...@anyware-tech.com>.
Le Jeudi 20 Juillet 2006 22:18, Marvin Humphrey a écrit :
> On Jul 19, 2006, at 10:26 AM, Nicolas Lalevée wrote:
> > Then I looked deeper in the Lucene file format, and I manage to
> > introduce some
> > generic field metadata without breaking the file format
> > compatibility. I just
> > used another bit of the "Bits" to mark that there is or not some
> > metadata on
> > the field. And the metadata is stored next to it :
> > DocFieldData --> FieldCount, <FieldNum, Bits, FieldMetadata,
> > Value>^FieldCount
> > FieldMetadata --> ValueSize, <Byte>^ValueSize
>
> My thought is instead of providing an ever-lengthening fixed menu of
> field-types to choose from, that the menu should be per-index and the
> codec should be indicated by an integer pointing to a spot on that menu.

In fact, that was my first implementaion. The problem with that is you can 
only store one value. But thinking a little more about it, storing one or 
more value is not an issue, because with the solution I proposed, no space is 
saved at all.
In fact, when I thought about this format of field metadata, I was thinking 
about a way to make the Lucene user specify how to store it in the Lucene 
index format. For instance, the simple one would specify that it's a pointeur 
on some metadata (as you proposed), another one would specify that there are 
two pointeurs (in my use case, one for type, the other one for the language), 
and another one whould specify that it will be store directly as it is 
actually an integer (so no need to make a pointer on integer. But it was just 
a thought, I don't know if it is possible. WDYT ?

> > Does this feature interest the Lucene commiters ? Should I provide
> > a patch in
> > Jira? If not, is there any common place where to provide some patch
> > for some
> > Lucene hackers (ie not necessaraily commiters) ?
> >
> > So, Marvin, could you provide your patch about payload ?
>
> I'm totally slammed this month because I got a talk accepted at OSCON
> late and so I'm taking an unexpected week off in the midst of a very
> busy time.

So, have a nice OSCON ! ;)

> There is not a patch per se, in any case. 

Oh yes of course. In fact Michael have already done something, I have switched 
the names, sorry.
So, Michael, could you provide your patch about payload ?

> > And is there a wiki page where there is a starting point about
> > defining the
> > future index format ?
>
> http://wiki.apache.org/jakarta-lucene/FlexibleIndexing

ok thank you.

Nicolas

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible index format / Payloads Cont'd

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Jul 19, 2006, at 10:26 AM, Nicolas Lalevée wrote:

> Then I looked deeper in the Lucene file format, and I manage to  
> introduce some
> generic field metadata without breaking the file format  
> compatibility. I just
> used another bit of the "Bits" to mark that there is or not some  
> metadata on
> the field. And the metadata is stored next to it :
> DocFieldData --> FieldCount, <FieldNum, Bits, FieldMetadata,  
> Value>^FieldCount
> FieldMetadata --> ValueSize, <Byte>^ValueSize

My thought is instead of providing an ever-lengthening fixed menu of  
field-types to choose from, that the menu should be per-index and the  
codec should be indicated by an integer pointing to a spot on that menu.

> Does this feature interest the Lucene commiters ? Should I provide  
> a patch in
> Jira? If not, is there any common place where to provide some patch  
> for some
> Lucene hackers (ie not necessaraily commiters) ?
>
> So, Marvin, could you provide your patch about payload ?

I'm totally slammed this month because I got a talk accepted at OSCON  
late and so I'm taking an unexpected week off in the midst of a very  
busy time.  There is not a patch per se, in any case.

> And is there a wiki page where there is a starting point about  
> defining the
> future index format ?

http://wiki.apache.org/jakarta-lucene/FlexibleIndexing

Best,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible index format / Payloads Cont'd

Posted by Nicolas Lalevée <ni...@anyware-tech.com>.
Le Mercredi 05 Juillet 2006 13:23, Michael Busch a écrit :
> Doug Cutting wrote:
> > Marvin Humphrey wrote:
> >> IMO, this should wait.  It's going to be freakishly difficult to get
> >> this stuff to work and maintain the commitments that Doug has laid
> >> out for backwards compatibility.
> >
> > Perhaps we can implement an all-new index format, in a new package.
> > An implementation of IndexReader can be provided to integrate with
> > existing search code.  And the ability to add an IndexReader to an
> > index can be provided to upgrade existing indexes to the new format.
> > So the new code would not need to be able to process an old index: the
> > old code can continue to do that.  Does that make sense?  Is that
> > "freakishly difficult"?  We'll need the ability to sniff a directory
> > and tell which version of index it contains, but that should not be
> > too hard.
> >
> > Doug
>
> +1. I agree that this approach would make it much easier to develop a
> new index format without the commitment of being backward-compatible. I
> would like to help working on a new index format. Who else is going to
> work on it?

I am also interested in improving Lucene too. I took time to respond to this 
thread because I am quite new to Lucene, so I have to learn what you talked 
about, in fact what a payload is. But here it is, I get it ! :)

What I have to do is a web application which will do some faceted search. My 
current workaround is transforming each query in several queries, each by 
categories. So I am interested of your current work.

I had also another issue with the field. Some field can have a type (integer, 
date, string), and/or a language. It is typically some metadata on fields. 
The quick workaround I did is to put the info in the field between some 
square brackets. So I had to do a SkipPrefixTokenizer... dirt but almost 
quick to implement.
Then I looked deeper in the Lucene file format, and I manage to introduce some 
generic field metadata without breaking the file format compatibility. I just 
used another bit of the "Bits" to mark that there is or not some metadata on 
the field. And the metadata is stored next to it :
DocFieldData --> FieldCount, <FieldNum, Bits, FieldMetadata, Value>^FieldCount
FieldMetadata --> ValueSize, <Byte>^ValueSize

Does this feature interest the Lucene commiters ? Should I provide a patch in 
Jira? If not, is there any common place where to provide some patch for some 
Lucene hackers (ie not necessaraily commiters) ?

So, Marvin, could you provide your patch about payload ?
And is there a wiki page where there is a starting point about defining the 
future index format ?

cheers,
Nicolas

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible index format / Payloads Cont'd

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Jul 4, 2006, at 3:35 AM, Doug Cutting wrote:

> Marvin Humphrey wrote:
>> IMO, this should wait.  It's going to be freakishly difficult to  
>> get this stuff to work and maintain the commitments that Doug has  
>> laid out for backwards compatibility.
>
> Perhaps we can implement an all-new index format, in a new package.

/me whistles low and grins.

org.apache.lucene.invindex?  As in inverted index, InvIndexer, and  
IIReader?

org.apache.lucene.ix? As in IxWriter and IxReader?

> An implementation of IndexReader can be provided to integrate with  
> existing search code.  And the ability to add an IndexReader to an  
> index can be provided to upgrade existing indexes to the new  
> format.  So the new code would not need to be able to process an  
> old index: the old code can continue to do that.  Does that make  
> sense?  Is that "freakishly difficult"?

It's labor-intensive -- that's a lot of code, to write and to test!   
But it would be a lot of code regardless, and it probably introduces  
fewer bugs and complications putting everything in a new package than  
interweaving so much new stuff into the existing code base.

The difficulty of keeping two packages afloat simultaneously will  
depend on how loose the coupling is between org.apache.lucene.index  
and the rest of Lucene.

> We'll need the ability to sniff a directory and tell which version  
> of index it contains, but that should not be too hard.

As simple as touching a meaningless file, if need be.  But I'll be  
arguing for the introduction of a global field definition file, which  
would serve just fine for that purpose.  ;)

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible index format / Payloads Cont'd

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jul 5, 2006, at 7:43 AM, Doug Cutting wrote:
> The folks working on Lucy are probably interested (Marvin & David).  
> Perhaps the first thing should be to specify the file format, then  
> implement it both in Java (for Lucene Java) and C (for Lucy).  
> Independent implementations will provide good compatibility  
> testing, and better validate the file format documentation.
>
> The specification could initially live in the wiki.

What about a formal electronic specification of the file format?  I  
hesitate to suggest XML because there is no good reason XML makes  
sense as a general purpose "language" (*wink* to Mr. Bray), but that  
is at least a common denominator among all languages.  An formal  
process-able format specification would allow code generation of low- 
level I/O functions, and, of course, the documentation itself in web  
presentable form.

The way the current file format documentation is structured in a  
computer-friendly way to digest it would be sweet.  Food for thought.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible index format / Payloads Cont'd

Posted by Doug Cutting <cu...@apache.org>.
Michael Busch wrote:
> I would like to help working on a new index format.
> Who else is going to work on it?

The folks working on Lucy are probably interested (Marvin & David). 
Perhaps the first thing should be to specify the file format, then 
implement it both in Java (for Lucene Java) and C (for Lucy). 
Independent implementations will provide good compatibility testing, and 
better validate the file format documentation.

The specification could initially live in the wiki.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible index format / Payloads Cont'd

Posted by Michael Busch <bu...@gmail.com>.
Doug Cutting wrote:
> Marvin Humphrey wrote:
>> IMO, this should wait.  It's going to be freakishly difficult to get 
>> this stuff to work and maintain the commitments that Doug has laid 
>> out for backwards compatibility.
>
> Perhaps we can implement an all-new index format, in a new package.  
> An implementation of IndexReader can be provided to integrate with 
> existing search code.  And the ability to add an IndexReader to an 
> index can be provided to upgrade existing indexes to the new format.  
> So the new code would not need to be able to process an old index: the 
> old code can continue to do that.  Does that make sense?  Is that 
> "freakishly difficult"?  We'll need the ability to sniff a directory 
> and tell which version of index it contains, but that should not be 
> too hard.
>
> Doug
>
+1. I agree that this approach would make it much easier to develop a 
new index format without the commitment of being backward-compatible. I 
would like to help working on a new index format. Who else is going to 
work on it?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible index format / Payloads Cont'd

Posted by Doug Cutting <cu...@apache.org>.
Marvin Humphrey wrote:
> IMO, this should wait.  It's going to be freakishly difficult to get 
> this stuff to work and maintain the commitments that Doug has laid out 
> for backwards compatibility.

Perhaps we can implement an all-new index format, in a new package.  An 
implementation of IndexReader can be provided to integrate with existing 
search code.  And the ability to add an IndexReader to an index can be 
provided to upgrade existing indexes to the new format.  So the new code 
would not need to be able to process an old index: the old code can 
continue to do that.  Does that make sense?  Is that "freakishly 
difficult"?  We'll need the ability to sniff a directory and tell which 
version of index it contains, but that should not be too hard.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible index format / Payloads Cont'd

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Jun 29, 2006, at 2:22 PM, Michael Busch wrote:

>   - Is there a concrete design?

Not that I am aware of.

> I have the feeling, that many people are interested in having a
> flexible index format. There are already various use cases:
>   - Efficient parametric search

This comes at the expense of a significant file size increase and  
performance hit.  Think a book index that not only lists page number  
but also category.

   axle => 3, 67, 89, 244

vs...

   axle => 3 cars, 67 cars, 89 trucks, 244 cars

Scanning through the latter is going to be more expensive.  It might  
be worth it in specific cases, but it's not the long-hoped-for  
panacea that would give Lucene all the  features of an RDBMS without  
incurring any costs.  :)

>   - Part Of Speech (POS) annotations with each position

This is an example of where it might be worth it... to Grant, and  
Grant only.

Personally, I'm less interested in adding new features than I am in  
solidifying and improving the core.

The benefits I care about are:

   * Decouple Lucene from it's file format.
     o Make back-compatibility easier.
     o Make refactoring easier.
     o All the other goodness that comes with loose coupling.
   * Improve IR precision, by writing a Boolean Scorer that
     takes position into account, a la Brin/Page '98.
   * Decrease time to launch a Searcher from rest.
   * Simplify Lucene, conceptually.
     o Indexes would have three parts: Term dictionary,
       Postings, and Storage.
     o Each part could be pluggable, following this format:
       <header><object>+
       * The de-serialization for each object is determined by
         a plugin spec'd in the header.
       * It's probably better to have separate header and data
         files.

> I would suggest to split up the whole work to have smaller work items
> and to have clearly defined milestones. Thus I suggest the
> following steps:
> 1. Introduce postings file with the following format:
>   <DocDelta, Payload>*
>     DocDelta --> VInt
>     DocDelta/2 is the difference between this document number and
>     the previous document number.
>     Payload --> Byte, if DocDelta is even
>     Payload --> <Payload_Length, Payload_Data>, if DocDelta is odd
>       Payload_Length --> VInt
>       Payload_Data   --> Byte^Payload_Length

Good stuff!  Now, if you put that whole thing in a plugin, you'll  
have the chance to refine it even after deployment if you think of a  
way to improve it -- by adding another plugin.  And, if it becomes  
too unwieldy and inflexible, you're not stuck with it.

>   Furthermore, it should be possible to enabled/disable payloads
>   on field level.

Maybe each field should get its own file, and its own encoding/ 
decoding object.  Then you don't have to check each object/record to  
see which codec to use.

Or maybe there should be an array of codec objects, indexed by field  
number.

   fieldNum = input->readVint();
   decoders[fieldNum].read(input);

> 2. Add multilevel skipping (tree structure) for the postings-file.
>   One-level skipping, as being used now in Lucene, is probably
>   not efficient enough for the new postings file, because it can
>   be very big. Question: Should we include skipping information
>   directly in the postings file, or should we introduce a new file
>   containing the skipping infos? I think it should improve cache
>   performance to have the skip tree in a different file.

Interesting.  I think I'd punt and leave it up to the plugin.  Maybe  
you'd have an extra large header if there was a lot of stuff to be  
cached.

> 3. Optional: Add a type-system for the payloads to make it
>   easier to develop PostingsWriter/Reader plugins.

IMO, this should wait.  It's going to be freakishly difficult to get  
this stuff to work and maintain the commitments that Doug has laid  
out for backwards compatibility.  There's also going to be trade- 
offs, and so I'd anticipate contentious, interminable debate along  
the lines of the recent Java 1.4/1.5 thread once there's real code  
and it becomes clear who's lost a clock tick or two.

Actually, I think pushing this forward is going to be so difficult,  
that I'll be focusing my attentions on implementing it elsewhere.

> 4. Make the PostingsWriter/Reader pluggable and develop default
>   PostingsWriter/Reader plugins, that store frequencies, positions,
>   and norms as payloads in the postings file. Should be configurable,
>   to enable the different options Doug suggested:
>   a. <doc>+
>   b. <doc, boost>+
>   c. <doc, freq, <position>+ >+
>   d. <doc, freq, <position, boost>+ >+

Got any ideas as to how the Field constructors should look?

> 5. Develop new or extend existing PostingsWriter/Reader plugins for
>   desired features like XML search, POS, multi-faceted search, ...

People will definitely want to scratch their own itches, but I'd  
argue that this stuff should start out private.  And maybe stay that  
way!

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org