You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Doug Cutting <cu...@apache.org> on 2006/07/04 12:35:52 UTC

Re: Flexible index format / Payloads Cont'd

Marvin Humphrey wrote:
> IMO, this should wait.  It's going to be freakishly difficult to get 
> this stuff to work and maintain the commitments that Doug has laid out 
> for backwards compatibility.

Perhaps we can implement an all-new index format, in a new package.  An 
implementation of IndexReader can be provided to integrate with existing 
search code.  And the ability to add an IndexReader to an index can be 
provided to upgrade existing indexes to the new format.  So the new code 
would not need to be able to process an old index: the old code can 
continue to do that.  Does that make sense?  Is that "freakishly 
difficult"?  We'll need the ability to sniff a directory and tell which 
version of index it contains, but that should not be too hard.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible index format / Payloads Cont'd

Posted by Nicolas Lalevée <ni...@anyware-tech.com>.

Hi,

Le Lundi 31 Juillet 2006 17:28, robert engels a écrit :
> Doing this beak compatibility with non-Java Lucene implementations.

For me, a such compatibilty is the file format one. Am I wrong ?
In such a case, I don't see any compatibilty break as the default 
implementation of FieldsDataWriter is a actual one. And if I generate an 
index with my custom writer, I will expect my index to be uncompatible with 
other implementation, even with other Java ones.

> Not sure it matters, but I thought I would point it out. I have
> always thought that Lucene should be compatible at an API level only,
> and MAYBE create a network access protocol for queries and updates.

I didn't talked about network access... I don't see your point...

>
> On Jul 31, 2006, at 10:25 AM, Nicolas Lalevée wrote:
> > Le Vendredi 21 Juillet 2006 12:37, Marvin Humphrey a écrit :
> >> On Jul 21, 2006, at 1:23 AM, Nicolas Lalevée wrote:
> >>> In fact, that was my first implementaion. The problem with that is
> >>> you can
> >>> only store one value. But thinking a little more about it, storing
> >>> one or
> >>> more value is not an issue, because with the solution I proposed,
> >>> no space is
> >>> saved at all.
> >>> In fact, when I thought about this format of field metadata, I was
> >>> thinking
> >>> about a way to make the Lucene user specify how to store it in the
> >>> Lucene
> >>> index format. For instance, the simple one would specify that it's
> >>> a pointeur
> >>> on some metadata (as you proposed), another one would specify that
> >>> there are
> >>> two pointeurs (in my use case, one for type, the other one for the
> >>> language),
> >>> and another one whould specify that it will be store directly as
> >>> it is
> >>> actually an integer (so no need to make a pointer on integer. But
> >>> it was just
> >>> a thought, I don't know if it is possible. WDYT ?
> >>
> >> I'm thinking that there would be a codecs file, say with the
> >> extension .cdx and this format:
> >>
> >>    Codecs (.cdx)  --> CodecCount, <CodecClassName>CodecCount
> >>    CodecCount     --> Uint32
> >>    CodecClassName --> String
> >>
> >> That file would be read in its entirety when the index was
> >> initialized and expanded into an array of codec objects, one per
> >> CodecClassName.
> >>
> >> The .fdx file would add an additional int per doc...
> >>
> >>    FieldIndex (.fdx) -->  <FieldValuesPosition,
> >>                            FieldValuesCodecNumber>SegSize
> >>    FieldValuesPosition    --> Uint64
> >>    FieldValuesCodecNumber --> Uint32
> >>
> >> Now, before you read any data from the .fdt file, you know how to
> >> interpret it.  You seek the .fdt IndexInput to the right spot, then
> >> feed it to the appropriate codec object from the codecs array.  The
> >> codec does the rest.  In your case, you might write a codec that
> >> would read a few bytes and strings of metadata up front.  Or you
> >> might have several different codecs, the identity of which indicates
> >> fixed values for certain metadata fields: FrenchDocument,
> >> ArabicDocument, etc.
> >>
> >> Would that scheme meet your needs?
> >
> > That looks good, but there is one restriction : it have to be per
> > document.
> > Let's explain a lit bit more my needs.
> >
> > In fact my app have to index some data which is structured in a RDF
> > graph.
> > Each rdf resource have a title and a description, each title and
> > description
> > being in different languages. The model we choose is to map a rdf
> > resource on
> > a document. Then the field name is the URI of the rdf property, and
> > the field
> > value is the litteral or other resource.
> > for instance :
> > doc1 : URI:http://foo.com   title:[en]foo   title:[fr]truc
> > So, in a document I will have several fields with different
> > languages. For my
> > use case, in fact I need only one "codec". It is a codec that will
> > get 3
> > values, 2 of them being optionnal : a language, a type, and a value.
> >
> > In fact I was thinking about a more generic version that will allow
> > the format
> > compatibility, keeping .fdx as is :
> >
> > FieldData (.fdt) -->  <DocFieldData>SegSize
> > DocFieldData --> FieldCount, <FieldNum, RawData>FieldCount
> >
> > And a default FieldsDataWriter will be the actual one, it will read
> > the
> > RawData as Bits, Value, with Value -->  String | BinaryValue,....
> > Then, for my app, I will provide some custom FieldsDataWriter that
> > will do
> > exactly what I want.
> >
> > What I don't know yet is how it breaks that API... because if I
> > want to
> > provide my own FieldsDataWriter, I would also want to have my own
> > implementation of Fieldable...
> > If you think this is a good idea, I will try to implement it.
> >
> > cheers,
> > Nicolas
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible index format / Payloads Cont'd

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Jul 31, 2006, at 8:25 AM, Nicolas Lalevée wrote:
>
> That looks good, but there is one restriction : it have to be per  
> document.

Yes, what I laid out was per-document - for each document, the fdx  
file would keep a file pointer and an integer mapping to a codec.

> In fact I was thinking about a more generic version that will allow  
> the format
> compatibility, keeping .fdx as is :
>
> FieldData (.fdt) -->  <DocFieldData>SegSize
> DocFieldData --> FieldCount, <FieldNum, RawData>FieldCount
>
> And a default FieldsDataWriter will be the actual one, it will read  
> the
> RawData as Bits, Value, with Value -->  String | BinaryValue,....
> Then, for my app, I will provide some custom FieldsDataWriter that  
> will do
> exactly what I want.

OK, that's quite similar, but with the info specifying how to  
deserialize the document stored in fdt rather than fdx.  However, I  
don't think what you're describing makes the field storage in Lucene  
arbitrarily extensible, since you're just going to override  
FieldsWriter/FieldsReader rather than modify them so that they can  
use arbitrary codecs.

I think what I want to do is turn Lucene into an Object-Oriented  
Database, or at least have Lucene adopt some characteristics of an  
ODBMS.  However, I haven't used a real ODBMS and I'm not up on the  
theory, so I can't say for sure.  I've been doing a little reading  
here and there on object databases, but I've been extraordinarily  
busy the last few weeks and haven't been able to study it in depth.

The main point is this:

Lucene users have diverse needs for what gets stored in the document/ 
field storage.  We've been meeting those needs by assigning more and  
more bit flags.  That can't continue that ad infinitum.  However, we  
*can* meet everyone's needs by applying a variant of the "Replace  
Conditionals With Polymorphism" refactoring technique...

http://xrl.us/p3kn (Link to www.eli.sdsu.edu)

Think of those bit flags as an if-else chain.  Instead of all those  
conditionals describing all the attributes of the Lucene Document you  
want to store at that file pointer, we allow you to put whatever kind  
of serialized object you desire there.  Maybe it's a Lucene  
Document.  Maybe it's a FrechDocument.  Maybe it's a  
RussianDocument.  Maybe it's a wrapped-up jpg.  You choose.

Instead of continually adding to the complexity of the  
deserialization algorithm, we we make that deserialization algorithm  
user-definable.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible index format / Payloads Cont'd

Posted by robert engels <re...@ix.netcom.com>.

Doing this beak compatibility with non-Java Lucene implementations.  
Not sure it matters, but I thought I would point it out. I have  
always thought that Lucene should be compatible at an API level only,  
and MAYBE create a network access protocol for queries and updates.

On Jul 31, 2006, at 10:25 AM, Nicolas Lalevée wrote:

> Le Vendredi 21 Juillet 2006 12:37, Marvin Humphrey a écrit :
>> On Jul 21, 2006, at 1:23 AM, Nicolas Lalevée wrote:
>>> In fact, that was my first implementaion. The problem with that is
>>> you can
>>> only store one value. But thinking a little more about it, storing
>>> one or
>>> more value is not an issue, because with the solution I proposed,
>>> no space is
>>> saved at all.
>>> In fact, when I thought about this format of field metadata, I was
>>> thinking
>>> about a way to make the Lucene user specify how to store it in the
>>> Lucene
>>> index format. For instance, the simple one would specify that it's
>>> a pointeur
>>> on some metadata (as you proposed), another one would specify that
>>> there are
>>> two pointeurs (in my use case, one for type, the other one for the
>>> language),
>>> and another one whould specify that it will be store directly as  
>>> it is
>>> actually an integer (so no need to make a pointer on integer. But
>>> it was just
>>> a thought, I don't know if it is possible. WDYT ?
>>
>> I'm thinking that there would be a codecs file, say with the
>> extension .cdx and this format:
>>
>>    Codecs (.cdx)  --> CodecCount, <CodecClassName>CodecCount
>>    CodecCount     --> Uint32
>>    CodecClassName --> String
>>
>> That file would be read in its entirety when the index was
>> initialized and expanded into an array of codec objects, one per
>> CodecClassName.
>>
>> The .fdx file would add an additional int per doc...
>>
>>    FieldIndex (.fdx) -->  <FieldValuesPosition,
>>                            FieldValuesCodecNumber>SegSize
>>    FieldValuesPosition    --> Uint64
>>    FieldValuesCodecNumber --> Uint32
>>
>> Now, before you read any data from the .fdt file, you know how to
>> interpret it.  You seek the .fdt IndexInput to the right spot, then
>> feed it to the appropriate codec object from the codecs array.  The
>> codec does the rest.  In your case, you might write a codec that
>> would read a few bytes and strings of metadata up front.  Or you
>> might have several different codecs, the identity of which indicates
>> fixed values for certain metadata fields: FrenchDocument,
>> ArabicDocument, etc.
>>
>> Would that scheme meet your needs?
>
> That looks good, but there is one restriction : it have to be per  
> document.
> Let's explain a lit bit more my needs.
>
> In fact my app have to index some data which is structured in a RDF  
> graph.
> Each rdf resource have a title and a description, each title and  
> description
> being in different languages. The model we choose is to map a rdf  
> resource on
> a document. Then the field name is the URI of the rdf property, and  
> the field
> value is the litteral or other resource.
> for instance :
> doc1 : URI:http://foo.com   title:[en]foo   title:[fr]truc
> So, in a document I will have several fields with different  
> languages. For my
> use case, in fact I need only one "codec". It is a codec that will  
> get 3
> values, 2 of them being optionnal : a language, a type, and a value.
>
> In fact I was thinking about a more generic version that will allow  
> the format
> compatibility, keeping .fdx as is :
>
> FieldData (.fdt) -->  <DocFieldData>SegSize
> DocFieldData --> FieldCount, <FieldNum, RawData>FieldCount
>
> And a default FieldsDataWriter will be the actual one, it will read  
> the
> RawData as Bits, Value, with Value -->  String | BinaryValue,....
> Then, for my app, I will provide some custom FieldsDataWriter that  
> will do
> exactly what I want.
>
> What I don't know yet is how it breaks that API... because if I  
> want to
> provide my own FieldsDataWriter, I would also want to have my own
> implementation of Fieldable...
> If you think this is a good idea, I will try to implement it.
>
> cheers,
> Nicolas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible index format / Payloads Cont'd

Posted by Nicolas Lalevée <ni...@anyware-tech.com>.

Le Vendredi 21 Juillet 2006 12:37, Marvin Humphrey a écrit :
> On Jul 21, 2006, at 1:23 AM, Nicolas Lalevée wrote:
> > In fact, that was my first implementaion. The problem with that is
> > you can
> > only store one value. But thinking a little more about it, storing
> > one or
> > more value is not an issue, because with the solution I proposed,
> > no space is
> > saved at all.
> > In fact, when I thought about this format of field metadata, I was
> > thinking
> > about a way to make the Lucene user specify how to store it in the
> > Lucene
> > index format. For instance, the simple one would specify that it's
> > a pointeur
> > on some metadata (as you proposed), another one would specify that
> > there are
> > two pointeurs (in my use case, one for type, the other one for the
> > language),
> > and another one whould specify that it will be store directly as it is
> > actually an integer (so no need to make a pointer on integer. But
> > it was just
> > a thought, I don't know if it is possible. WDYT ?
>
> I'm thinking that there would be a codecs file, say with the
> extension .cdx and this format:
>
>    Codecs (.cdx)  --> CodecCount, <CodecClassName>CodecCount
>    CodecCount     --> Uint32
>    CodecClassName --> String
>
> That file would be read in its entirety when the index was
> initialized and expanded into an array of codec objects, one per
> CodecClassName.
>
> The .fdx file would add an additional int per doc...
>
>    FieldIndex (.fdx) -->  <FieldValuesPosition,
>                            FieldValuesCodecNumber>SegSize
>    FieldValuesPosition    --> Uint64
>    FieldValuesCodecNumber --> Uint32
>
> Now, before you read any data from the .fdt file, you know how to
> interpret it.  You seek the .fdt IndexInput to the right spot, then
> feed it to the appropriate codec object from the codecs array.  The
> codec does the rest.  In your case, you might write a codec that
> would read a few bytes and strings of metadata up front.  Or you
> might have several different codecs, the identity of which indicates
> fixed values for certain metadata fields: FrenchDocument,
> ArabicDocument, etc.
>
> Would that scheme meet your needs?

That looks good, but there is one restriction : it have to be per document. 
Let's explain a lit bit more my needs.

In fact my app have to index some data which is structured in a RDF graph. 
Each rdf resource have a title and a description, each title and description 
being in different languages. The model we choose is to map a rdf resource on 
a document. Then the field name is the URI of the rdf property, and the field 
value is the litteral or other resource.
for instance :
doc1 : URI:http://foo.com   title:[en]foo   title:[fr]truc
So, in a document I will have several fields with different languages. For my 
use case, in fact I need only one "codec". It is a codec that will get 3 
values, 2 of them being optionnal : a language, a type, and a value.

In fact I was thinking about a more generic version that will allow the format 
compatibility, keeping .fdx as is :

FieldData (.fdt) -->  <DocFieldData>SegSize
DocFieldData --> FieldCount, <FieldNum, RawData>FieldCount

And a default FieldsDataWriter will be the actual one, it will read the 
RawData as Bits, Value, with Value -->  String | BinaryValue,....
Then, for my app, I will provide some custom FieldsDataWriter that will do 
exactly what I want.

What I don't know yet is how it breaks that API... because if I want to 
provide my own FieldsDataWriter, I would also want to have my own 
implementation of Fieldable...
If you think this is a good idea, I will try to implement it.

cheers,
Nicolas

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible index format / Payloads Cont'd

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Jul 21, 2006, at 1:23 AM, Nicolas Lalevée wrote:
> In fact, that was my first implementaion. The problem with that is  
> you can
> only store one value. But thinking a little more about it, storing  
> one or
> more value is not an issue, because with the solution I proposed,  
> no space is
> saved at all.
> In fact, when I thought about this format of field metadata, I was  
> thinking
> about a way to make the Lucene user specify how to store it in the  
> Lucene
> index format. For instance, the simple one would specify that it's  
> a pointeur
> on some metadata (as you proposed), another one would specify that  
> there are
> two pointeurs (in my use case, one for type, the other one for the  
> language),
> and another one whould specify that it will be store directly as it is
> actually an integer (so no need to make a pointer on integer. But  
> it was just
> a thought, I don't know if it is possible. WDYT ?

I'm thinking that there would be a codecs file, say with the  
extension .cdx and this format:

   Codecs (.cdx)  --> CodecCount, <CodecClassName>CodecCount
   CodecCount     --> Uint32
   CodecClassName --> String

That file would be read in its entirety when the index was  
initialized and expanded into an array of codec objects, one per  
CodecClassName.

The .fdx file would add an additional int per doc...

   FieldIndex (.fdx) -->  <FieldValuesPosition,
                           FieldValuesCodecNumber>SegSize
   FieldValuesPosition    --> Uint64
   FieldValuesCodecNumber --> Uint32

Now, before you read any data from the .fdt file, you know how to  
interpret it.  You seek the .fdt IndexInput to the right spot, then  
feed it to the appropriate codec object from the codecs array.  The  
codec does the rest.  In your case, you might write a codec that  
would read a few bytes and strings of metadata up front.  Or you  
might have several different codecs, the identity of which indicates  
fixed values for certain metadata fields: FrenchDocument,  
ArabicDocument, etc.

Would that scheme meet your needs?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible index format / Payloads Cont'd

Posted by Nicolas Lalevée <ni...@anyware-tech.com>.

Le Jeudi 20 Juillet 2006 22:18, Marvin Humphrey a écrit :
> On Jul 19, 2006, at 10:26 AM, Nicolas Lalevée wrote:
> > Then I looked deeper in the Lucene file format, and I manage to
> > introduce some
> > generic field metadata without breaking the file format
> > compatibility. I just
> > used another bit of the "Bits" to mark that there is or not some
> > metadata on
> > the field. And the metadata is stored next to it :
> > DocFieldData --> FieldCount, <FieldNum, Bits, FieldMetadata,
> > Value>^FieldCount
> > FieldMetadata --> ValueSize, <Byte>^ValueSize
>
> My thought is instead of providing an ever-lengthening fixed menu of
> field-types to choose from, that the menu should be per-index and the
> codec should be indicated by an integer pointing to a spot on that menu.

In fact, that was my first implementaion. The problem with that is you can 
only store one value. But thinking a little more about it, storing one or 
more value is not an issue, because with the solution I proposed, no space is 
saved at all.
In fact, when I thought about this format of field metadata, I was thinking 
about a way to make the Lucene user specify how to store it in the Lucene 
index format. For instance, the simple one would specify that it's a pointeur 
on some metadata (as you proposed), another one would specify that there are 
two pointeurs (in my use case, one for type, the other one for the language), 
and another one whould specify that it will be store directly as it is 
actually an integer (so no need to make a pointer on integer. But it was just 
a thought, I don't know if it is possible. WDYT ?

> > Does this feature interest the Lucene commiters ? Should I provide
> > a patch in
> > Jira? If not, is there any common place where to provide some patch
> > for some
> > Lucene hackers (ie not necessaraily commiters) ?
> >
> > So, Marvin, could you provide your patch about payload ?
>
> I'm totally slammed this month because I got a talk accepted at OSCON
> late and so I'm taking an unexpected week off in the midst of a very
> busy time.

So, have a nice OSCON ! ;)

> There is not a patch per se, in any case. 

Oh yes of course. In fact Michael have already done something, I have switched 
the names, sorry.
So, Michael, could you provide your patch about payload ?

> > And is there a wiki page where there is a starting point about
> > defining the
> > future index format ?
>
> http://wiki.apache.org/jakarta-lucene/FlexibleIndexing

ok thank you.

Nicolas

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible index format / Payloads Cont'd

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Jul 19, 2006, at 10:26 AM, Nicolas Lalevée wrote:

> Then I looked deeper in the Lucene file format, and I manage to  
> introduce some
> generic field metadata without breaking the file format  
> compatibility. I just
> used another bit of the "Bits" to mark that there is or not some  
> metadata on
> the field. And the metadata is stored next to it :
> DocFieldData --> FieldCount, <FieldNum, Bits, FieldMetadata,  
> Value>^FieldCount
> FieldMetadata --> ValueSize, <Byte>^ValueSize

My thought is instead of providing an ever-lengthening fixed menu of  
field-types to choose from, that the menu should be per-index and the  
codec should be indicated by an integer pointing to a spot on that menu.

> Does this feature interest the Lucene commiters ? Should I provide  
> a patch in
> Jira? If not, is there any common place where to provide some patch  
> for some
> Lucene hackers (ie not necessaraily commiters) ?
>
> So, Marvin, could you provide your patch about payload ?

I'm totally slammed this month because I got a talk accepted at OSCON  
late and so I'm taking an unexpected week off in the midst of a very  
busy time.  There is not a patch per se, in any case.

> And is there a wiki page where there is a starting point about  
> defining the
> future index format ?

http://wiki.apache.org/jakarta-lucene/FlexibleIndexing

Best,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible index format / Payloads Cont'd

Posted by Nicolas Lalevée <ni...@anyware-tech.com>.

Le Mercredi 05 Juillet 2006 13:23, Michael Busch a écrit :
> Doug Cutting wrote:
> > Marvin Humphrey wrote:
> >> IMO, this should wait.  It's going to be freakishly difficult to get
> >> this stuff to work and maintain the commitments that Doug has laid
> >> out for backwards compatibility.
> >
> > Perhaps we can implement an all-new index format, in a new package.
> > An implementation of IndexReader can be provided to integrate with
> > existing search code.  And the ability to add an IndexReader to an
> > index can be provided to upgrade existing indexes to the new format.
> > So the new code would not need to be able to process an old index: the
> > old code can continue to do that.  Does that make sense?  Is that
> > "freakishly difficult"?  We'll need the ability to sniff a directory
> > and tell which version of index it contains, but that should not be
> > too hard.
> >
> > Doug
>
> +1. I agree that this approach would make it much easier to develop a
> new index format without the commitment of being backward-compatible. I
> would like to help working on a new index format. Who else is going to
> work on it?

I am also interested in improving Lucene too. I took time to respond to this 
thread because I am quite new to Lucene, so I have to learn what you talked 
about, in fact what a payload is. But here it is, I get it ! :)

What I have to do is a web application which will do some faceted search. My 
current workaround is transforming each query in several queries, each by 
categories. So I am interested of your current work.

I had also another issue with the field. Some field can have a type (integer, 
date, string), and/or a language. It is typically some metadata on fields. 
The quick workaround I did is to put the info in the field between some 
square brackets. So I had to do a SkipPrefixTokenizer... dirt but almost 
quick to implement.
Then I looked deeper in the Lucene file format, and I manage to introduce some 
generic field metadata without breaking the file format compatibility. I just 
used another bit of the "Bits" to mark that there is or not some metadata on 
the field. And the metadata is stored next to it :
DocFieldData --> FieldCount, <FieldNum, Bits, FieldMetadata, Value>^FieldCount
FieldMetadata --> ValueSize, <Byte>^ValueSize

Does this feature interest the Lucene commiters ? Should I provide a patch in 
Jira? If not, is there any common place where to provide some patch for some 
Lucene hackers (ie not necessaraily commiters) ?

So, Marvin, could you provide your patch about payload ?
And is there a wiki page where there is a starting point about defining the 
future index format ?

cheers,
Nicolas

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible index format / Payloads Cont'd

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Jul 4, 2006, at 3:35 AM, Doug Cutting wrote:

> Marvin Humphrey wrote:
>> IMO, this should wait.  It's going to be freakishly difficult to  
>> get this stuff to work and maintain the commitments that Doug has  
>> laid out for backwards compatibility.
>
> Perhaps we can implement an all-new index format, in a new package.

/me whistles low and grins.

org.apache.lucene.invindex?  As in inverted index, InvIndexer, and  
IIReader?

org.apache.lucene.ix? As in IxWriter and IxReader?

> An implementation of IndexReader can be provided to integrate with  
> existing search code.  And the ability to add an IndexReader to an  
> index can be provided to upgrade existing indexes to the new  
> format.  So the new code would not need to be able to process an  
> old index: the old code can continue to do that.  Does that make  
> sense?  Is that "freakishly difficult"?

It's labor-intensive -- that's a lot of code, to write and to test!   
But it would be a lot of code regardless, and it probably introduces  
fewer bugs and complications putting everything in a new package than  
interweaving so much new stuff into the existing code base.

The difficulty of keeping two packages afloat simultaneously will  
depend on how loose the coupling is between org.apache.lucene.index  
and the rest of Lucene.

> We'll need the ability to sniff a directory and tell which version  
> of index it contains, but that should not be too hard.

As simple as touching a meaningless file, if need be.  But I'll be  
arguing for the introduction of a global field definition file, which  
would serve just fine for that purpose.  ;)

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible index format / Payloads Cont'd

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jul 5, 2006, at 7:43 AM, Doug Cutting wrote:
> The folks working on Lucy are probably interested (Marvin & David).  
> Perhaps the first thing should be to specify the file format, then  
> implement it both in Java (for Lucene Java) and C (for Lucy).  
> Independent implementations will provide good compatibility  
> testing, and better validate the file format documentation.
>
> The specification could initially live in the wiki.

What about a formal electronic specification of the file format?  I  
hesitate to suggest XML because there is no good reason XML makes  
sense as a general purpose "language" (*wink* to Mr. Bray), but that  
is at least a common denominator among all languages.  An formal  
process-able format specification would allow code generation of low- 
level I/O functions, and, of course, the documentation itself in web  
presentable form.

The way the current file format documentation is structured in a  
computer-friendly way to digest it would be sweet.  Food for thought.

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible index format / Payloads Cont'd

Posted by Doug Cutting <cu...@apache.org>.

Michael Busch wrote:
> I would like to help working on a new index format.
> Who else is going to work on it?

The folks working on Lucy are probably interested (Marvin & David). 
Perhaps the first thing should be to specify the file format, then 
implement it both in Java (for Lucene Java) and C (for Lucy). 
Independent implementations will provide good compatibility testing, and 
better validate the file format documentation.

The specification could initially live in the wiki.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible index format / Payloads Cont'd

Posted by Michael Busch <bu...@gmail.com>.

Doug Cutting wrote:
> Marvin Humphrey wrote:
>> IMO, this should wait.  It's going to be freakishly difficult to get 
>> this stuff to work and maintain the commitments that Doug has laid 
>> out for backwards compatibility.
>
> Perhaps we can implement an all-new index format, in a new package.  
> An implementation of IndexReader can be provided to integrate with 
> existing search code.  And the ability to add an IndexReader to an 
> index can be provided to upgrade existing indexes to the new format.  
> So the new code would not need to be able to process an old index: the 
> old code can continue to do that.  Does that make sense?  Is that 
> "freakishly difficult"?  We'll need the ability to sniff a directory 
> and tell which version of index it contains, but that should not be 
> too hard.
>
> Doug
>
+1. I agree that this approach would make it much easier to develop a 
new index format without the commitment of being backward-compatible. I 
would like to help working on a new index format. Who else is going to 
work on it?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org