You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Doug Cutting <cu...@apache.org> on 2006/07/04 12:35:52 UTC
Re: Flexible index format / Payloads Cont'd
Marvin Humphrey wrote:
> IMO, this should wait. It's going to be freakishly difficult to get
> this stuff to work and maintain the commitments that Doug has laid out
> for backwards compatibility.
Perhaps we can implement an all-new index format, in a new package. An
implementation of IndexReader can be provided to integrate with existing
search code. And the ability to add an IndexReader to an index can be
provided to upgrade existing indexes to the new format. So the new code
would not need to be able to process an old index: the old code can
continue to do that. Does that make sense? Is that "freakishly
difficult"? We'll need the ability to sniff a directory and tell which
version of index it contains, but that should not be too hard.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Flexible index format / Payloads Cont'd
Posted by Nicolas Lalevée <ni...@anyware-tech.com>.
Hi,
Le Lundi 31 Juillet 2006 17:28, robert engels a écrit :
> Doing this beak compatibility with non-Java Lucene implementations.
For me, a such compatibilty is the file format one. Am I wrong ?
In such a case, I don't see any compatibilty break as the default
implementation of FieldsDataWriter is a actual one. And if I generate an
index with my custom writer, I will expect my index to be uncompatible with
other implementation, even with other Java ones.
> Not sure it matters, but I thought I would point it out. I have
> always thought that Lucene should be compatible at an API level only,
> and MAYBE create a network access protocol for queries and updates.
I didn't talked about network access... I don't see your point...
>
> On Jul 31, 2006, at 10:25 AM, Nicolas Lalevée wrote:
> > Le Vendredi 21 Juillet 2006 12:37, Marvin Humphrey a écrit :
> >> On Jul 21, 2006, at 1:23 AM, Nicolas Lalevée wrote:
> >>> In fact, that was my first implementaion. The problem with that is
> >>> you can
> >>> only store one value. But thinking a little more about it, storing
> >>> one or
> >>> more value is not an issue, because with the solution I proposed,
> >>> no space is
> >>> saved at all.
> >>> In fact, when I thought about this format of field metadata, I was
> >>> thinking
> >>> about a way to make the Lucene user specify how to store it in the
> >>> Lucene
> >>> index format. For instance, the simple one would specify that it's
> >>> a pointeur
> >>> on some metadata (as you proposed), another one would specify that
> >>> there are
> >>> two pointeurs (in my use case, one for type, the other one for the
> >>> language),
> >>> and another one whould specify that it will be store directly as
> >>> it is
> >>> actually an integer (so no need to make a pointer on integer. But
> >>> it was just
> >>> a thought, I don't know if it is possible. WDYT ?
> >>
> >> I'm thinking that there would be a codecs file, say with the
> >> extension .cdx and this format:
> >>
> >> Codecs (.cdx) --> CodecCount, <CodecClassName>CodecCount
> >> CodecCount --> Uint32
> >> CodecClassName --> String
> >>
> >> That file would be read in its entirety when the index was
> >> initialized and expanded into an array of codec objects, one per
> >> CodecClassName.
> >>
> >> The .fdx file would add an additional int per doc...
> >>
> >> FieldIndex (.fdx) --> <FieldValuesPosition,
> >> FieldValuesCodecNumber>SegSize
> >> FieldValuesPosition --> Uint64
> >> FieldValuesCodecNumber --> Uint32
> >>
> >> Now, before you read any data from the .fdt file, you know how to
> >> interpret it. You seek the .fdt IndexInput to the right spot, then
> >> feed it to the appropriate codec object from the codecs array. The
> >> codec does the rest. In your case, you might write a codec that
> >> would read a few bytes and strings of metadata up front. Or you
> >> might have several different codecs, the identity of which indicates
> >> fixed values for certain metadata fields: FrenchDocument,
> >> ArabicDocument, etc.
> >>
> >> Would that scheme meet your needs?
> >
> > That looks good, but there is one restriction : it have to be per
> > document.
> > Let's explain a lit bit more my needs.
> >
> > In fact my app have to index some data which is structured in a RDF
> > graph.
> > Each rdf resource have a title and a description, each title and
> > description
> > being in different languages. The model we choose is to map a rdf
> > resource on
> > a document. Then the field name is the URI of the rdf property, and
> > the field
> > value is the litteral or other resource.
> > for instance :
> > doc1 : URI:http://foo.com title:[en]foo title:[fr]truc
> > So, in a document I will have several fields with different
> > languages. For my
> > use case, in fact I need only one "codec". It is a codec that will
> > get 3
> > values, 2 of them being optionnal : a language, a type, and a value.
> >
> > In fact I was thinking about a more generic version that will allow
> > the format
> > compatibility, keeping .fdx as is :
> >
> > FieldData (.fdt) --> <DocFieldData>SegSize
> > DocFieldData --> FieldCount, <FieldNum, RawData>FieldCount
> >
> > And a default FieldsDataWriter will be the actual one, it will read
> > the
> > RawData as Bits, Value, with Value --> String | BinaryValue,....
> > Then, for my app, I will provide some custom FieldsDataWriter that
> > will do
> > exactly what I want.
> >
> > What I don't know yet is how it breaks that API... because if I
> > want to
> > provide my own FieldsDataWriter, I would also want to have my own
> > implementation of Fieldable...
> > If you think this is a good idea, I will try to implement it.
> >
> > cheers,
> > Nicolas
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Flexible index format / Payloads Cont'd
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Jul 31, 2006, at 8:25 AM, Nicolas Lalevée wrote:
>
> That looks good, but there is one restriction : it have to be per
> document.
Yes, what I laid out was per-document - for each document, the fdx
file would keep a file pointer and an integer mapping to a codec.
> In fact I was thinking about a more generic version that will allow
> the format
> compatibility, keeping .fdx as is :
>
> FieldData (.fdt) --> <DocFieldData>SegSize
> DocFieldData --> FieldCount, <FieldNum, RawData>FieldCount
>
> And a default FieldsDataWriter will be the actual one, it will read
> the
> RawData as Bits, Value, with Value --> String | BinaryValue,....
> Then, for my app, I will provide some custom FieldsDataWriter that
> will do
> exactly what I want.
OK, that's quite similar, but with the info specifying how to
deserialize the document stored in fdt rather than fdx. However, I
don't think what you're describing makes the field storage in Lucene
arbitrarily extensible, since you're just going to override
FieldsWriter/FieldsReader rather than modify them so that they can
use arbitrary codecs.
I think what I want to do is turn Lucene into an Object-Oriented
Database, or at least have Lucene adopt some characteristics of an
ODBMS. However, I haven't used a real ODBMS and I'm not up on the
theory, so I can't say for sure. I've been doing a little reading
here and there on object databases, but I've been extraordinarily
busy the last few weeks and haven't been able to study it in depth.
The main point is this:
Lucene users have diverse needs for what gets stored in the document/
field storage. We've been meeting those needs by assigning more and
more bit flags. That can't continue that ad infinitum. However, we
*can* meet everyone's needs by applying a variant of the "Replace
Conditionals With Polymorphism" refactoring technique...
http://xrl.us/p3kn (Link to www.eli.sdsu.edu)
Think of those bit flags as an if-else chain. Instead of all those
conditionals describing all the attributes of the Lucene Document you
want to store at that file pointer, we allow you to put whatever kind
of serialized object you desire there. Maybe it's a Lucene
Document. Maybe it's a FrechDocument. Maybe it's a
RussianDocument. Maybe it's a wrapped-up jpg. You choose.
Instead of continually adding to the complexity of the
deserialization algorithm, we we make that deserialization algorithm
user-definable.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Flexible index format / Payloads Cont'd
Posted by robert engels <re...@ix.netcom.com>.
Doing this beak compatibility with non-Java Lucene implementations.
Not sure it matters, but I thought I would point it out. I have
always thought that Lucene should be compatible at an API level only,
and MAYBE create a network access protocol for queries and updates.
On Jul 31, 2006, at 10:25 AM, Nicolas Lalevée wrote:
> Le Vendredi 21 Juillet 2006 12:37, Marvin Humphrey a écrit :
>> On Jul 21, 2006, at 1:23 AM, Nicolas Lalevée wrote:
>>> In fact, that was my first implementaion. The problem with that is
>>> you can
>>> only store one value. But thinking a little more about it, storing
>>> one or
>>> more value is not an issue, because with the solution I proposed,
>>> no space is
>>> saved at all.
>>> In fact, when I thought about this format of field metadata, I was
>>> thinking
>>> about a way to make the Lucene user specify how to store it in the
>>> Lucene
>>> index format. For instance, the simple one would specify that it's
>>> a pointeur
>>> on some metadata (as you proposed), another one would specify that
>>> there are
>>> two pointeurs (in my use case, one for type, the other one for the
>>> language),
>>> and another one whould specify that it will be store directly as
>>> it is
>>> actually an integer (so no need to make a pointer on integer. But
>>> it was just
>>> a thought, I don't know if it is possible. WDYT ?
>>
>> I'm thinking that there would be a codecs file, say with the
>> extension .cdx and this format:
>>
>> Codecs (.cdx) --> CodecCount, <CodecClassName>CodecCount
>> CodecCount --> Uint32
>> CodecClassName --> String
>>
>> That file would be read in its entirety when the index was
>> initialized and expanded into an array of codec objects, one per
>> CodecClassName.
>>
>> The .fdx file would add an additional int per doc...
>>
>> FieldIndex (.fdx) --> <FieldValuesPosition,
>> FieldValuesCodecNumber>SegSize
>> FieldValuesPosition --> Uint64
>> FieldValuesCodecNumber --> Uint32
>>
>> Now, before you read any data from the .fdt file, you know how to
>> interpret it. You seek the .fdt IndexInput to the right spot, then
>> feed it to the appropriate codec object from the codecs array. The
>> codec does the rest. In your case, you might write a codec that
>> would read a few bytes and strings of metadata up front. Or you
>> might have several different codecs, the identity of which indicates
>> fixed values for certain metadata fields: FrenchDocument,
>> ArabicDocument, etc.
>>
>> Would that scheme meet your needs?
>
> That looks good, but there is one restriction : it have to be per
> document.
> Let's explain a lit bit more my needs.
>
> In fact my app have to index some data which is structured in a RDF
> graph.
> Each rdf resource have a title and a description, each title and
> description
> being in different languages. The model we choose is to map a rdf
> resource on
> a document. Then the field name is the URI of the rdf property, and
> the field
> value is the litteral or other resource.
> for instance :
> doc1 : URI:http://foo.com title:[en]foo title:[fr]truc
> So, in a document I will have several fields with different
> languages. For my
> use case, in fact I need only one "codec". It is a codec that will
> get 3
> values, 2 of them being optionnal : a language, a type, and a value.
>
> In fact I was thinking about a more generic version that will allow
> the format
> compatibility, keeping .fdx as is :
>
> FieldData (.fdt) --> <DocFieldData>SegSize
> DocFieldData --> FieldCount, <FieldNum, RawData>FieldCount
>
> And a default FieldsDataWriter will be the actual one, it will read
> the
> RawData as Bits, Value, with Value --> String | BinaryValue,....
> Then, for my app, I will provide some custom FieldsDataWriter that
> will do
> exactly what I want.
>
> What I don't know yet is how it breaks that API... because if I
> want to
> provide my own FieldsDataWriter, I would also want to have my own
> implementation of Fieldable...
> If you think this is a good idea, I will try to implement it.
>
> cheers,
> Nicolas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Flexible index format / Payloads Cont'd
Posted by Nicolas Lalevée <ni...@anyware-tech.com>.
Le Vendredi 21 Juillet 2006 12:37, Marvin Humphrey a écrit :
> On Jul 21, 2006, at 1:23 AM, Nicolas Lalevée wrote:
> > In fact, that was my first implementaion. The problem with that is
> > you can
> > only store one value. But thinking a little more about it, storing
> > one or
> > more value is not an issue, because with the solution I proposed,
> > no space is
> > saved at all.
> > In fact, when I thought about this format of field metadata, I was
> > thinking
> > about a way to make the Lucene user specify how to store it in the
> > Lucene
> > index format. For instance, the simple one would specify that it's
> > a pointeur
> > on some metadata (as you proposed), another one would specify that
> > there are
> > two pointeurs (in my use case, one for type, the other one for the
> > language),
> > and another one whould specify that it will be store directly as it is
> > actually an integer (so no need to make a pointer on integer. But
> > it was just
> > a thought, I don't know if it is possible. WDYT ?
>
> I'm thinking that there would be a codecs file, say with the
> extension .cdx and this format:
>
> Codecs (.cdx) --> CodecCount, <CodecClassName>CodecCount
> CodecCount --> Uint32
> CodecClassName --> String
>
> That file would be read in its entirety when the index was
> initialized and expanded into an array of codec objects, one per
> CodecClassName.
>
> The .fdx file would add an additional int per doc...
>
> FieldIndex (.fdx) --> <FieldValuesPosition,
> FieldValuesCodecNumber>SegSize
> FieldValuesPosition --> Uint64
> FieldValuesCodecNumber --> Uint32
>
> Now, before you read any data from the .fdt file, you know how to
> interpret it. You seek the .fdt IndexInput to the right spot, then
> feed it to the appropriate codec object from the codecs array. The
> codec does the rest. In your case, you might write a codec that
> would read a few bytes and strings of metadata up front. Or you
> might have several different codecs, the identity of which indicates
> fixed values for certain metadata fields: FrenchDocument,
> ArabicDocument, etc.
>
> Would that scheme meet your needs?
That looks good, but there is one restriction : it have to be per document.
Let's explain a lit bit more my needs.
In fact my app have to index some data which is structured in a RDF graph.
Each rdf resource have a title and a description, each title and description
being in different languages. The model we choose is to map a rdf resource on
a document. Then the field name is the URI of the rdf property, and the field
value is the litteral or other resource.
for instance :
doc1 : URI:http://foo.com title:[en]foo title:[fr]truc
So, in a document I will have several fields with different languages. For my
use case, in fact I need only one "codec". It is a codec that will get 3
values, 2 of them being optionnal : a language, a type, and a value.
In fact I was thinking about a more generic version that will allow the format
compatibility, keeping .fdx as is :
FieldData (.fdt) --> <DocFieldData>SegSize
DocFieldData --> FieldCount, <FieldNum, RawData>FieldCount
And a default FieldsDataWriter will be the actual one, it will read the
RawData as Bits, Value, with Value --> String | BinaryValue,....
Then, for my app, I will provide some custom FieldsDataWriter that will do
exactly what I want.
What I don't know yet is how it breaks that API... because if I want to
provide my own FieldsDataWriter, I would also want to have my own
implementation of Fieldable...
If you think this is a good idea, I will try to implement it.
cheers,
Nicolas
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Flexible index format / Payloads Cont'd
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Jul 21, 2006, at 1:23 AM, Nicolas Lalevée wrote:
> In fact, that was my first implementaion. The problem with that is
> you can
> only store one value. But thinking a little more about it, storing
> one or
> more value is not an issue, because with the solution I proposed,
> no space is
> saved at all.
> In fact, when I thought about this format of field metadata, I was
> thinking
> about a way to make the Lucene user specify how to store it in the
> Lucene
> index format. For instance, the simple one would specify that it's
> a pointeur
> on some metadata (as you proposed), another one would specify that
> there are
> two pointeurs (in my use case, one for type, the other one for the
> language),
> and another one whould specify that it will be store directly as it is
> actually an integer (so no need to make a pointer on integer. But
> it was just
> a thought, I don't know if it is possible. WDYT ?
I'm thinking that there would be a codecs file, say with the
extension .cdx and this format:
Codecs (.cdx) --> CodecCount, <CodecClassName>CodecCount
CodecCount --> Uint32
CodecClassName --> String
That file would be read in its entirety when the index was
initialized and expanded into an array of codec objects, one per
CodecClassName.
The .fdx file would add an additional int per doc...
FieldIndex (.fdx) --> <FieldValuesPosition,
FieldValuesCodecNumber>SegSize
FieldValuesPosition --> Uint64
FieldValuesCodecNumber --> Uint32
Now, before you read any data from the .fdt file, you know how to
interpret it. You seek the .fdt IndexInput to the right spot, then
feed it to the appropriate codec object from the codecs array. The
codec does the rest. In your case, you might write a codec that
would read a few bytes and strings of metadata up front. Or you
might have several different codecs, the identity of which indicates
fixed values for certain metadata fields: FrenchDocument,
ArabicDocument, etc.
Would that scheme meet your needs?
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Flexible index format / Payloads Cont'd
Posted by Nicolas Lalevée <ni...@anyware-tech.com>.
Le Jeudi 20 Juillet 2006 22:18, Marvin Humphrey a écrit :
> On Jul 19, 2006, at 10:26 AM, Nicolas Lalevée wrote:
> > Then I looked deeper in the Lucene file format, and I manage to
> > introduce some
> > generic field metadata without breaking the file format
> > compatibility. I just
> > used another bit of the "Bits" to mark that there is or not some
> > metadata on
> > the field. And the metadata is stored next to it :
> > DocFieldData --> FieldCount, <FieldNum, Bits, FieldMetadata,
> > Value>^FieldCount
> > FieldMetadata --> ValueSize, <Byte>^ValueSize
>
> My thought is instead of providing an ever-lengthening fixed menu of
> field-types to choose from, that the menu should be per-index and the
> codec should be indicated by an integer pointing to a spot on that menu.
In fact, that was my first implementaion. The problem with that is you can
only store one value. But thinking a little more about it, storing one or
more value is not an issue, because with the solution I proposed, no space is
saved at all.
In fact, when I thought about this format of field metadata, I was thinking
about a way to make the Lucene user specify how to store it in the Lucene
index format. For instance, the simple one would specify that it's a pointeur
on some metadata (as you proposed), another one would specify that there are
two pointeurs (in my use case, one for type, the other one for the language),
and another one whould specify that it will be store directly as it is
actually an integer (so no need to make a pointer on integer. But it was just
a thought, I don't know if it is possible. WDYT ?
> > Does this feature interest the Lucene commiters ? Should I provide
> > a patch in
> > Jira? If not, is there any common place where to provide some patch
> > for some
> > Lucene hackers (ie not necessaraily commiters) ?
> >
> > So, Marvin, could you provide your patch about payload ?
>
> I'm totally slammed this month because I got a talk accepted at OSCON
> late and so I'm taking an unexpected week off in the midst of a very
> busy time.
So, have a nice OSCON ! ;)
> There is not a patch per se, in any case.
Oh yes of course. In fact Michael have already done something, I have switched
the names, sorry.
So, Michael, could you provide your patch about payload ?
> > And is there a wiki page where there is a starting point about
> > defining the
> > future index format ?
>
> http://wiki.apache.org/jakarta-lucene/FlexibleIndexing
ok thank you.
Nicolas
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Flexible index format / Payloads Cont'd
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Jul 19, 2006, at 10:26 AM, Nicolas Lalevée wrote:
> Then I looked deeper in the Lucene file format, and I manage to
> introduce some
> generic field metadata without breaking the file format
> compatibility. I just
> used another bit of the "Bits" to mark that there is or not some
> metadata on
> the field. And the metadata is stored next to it :
> DocFieldData --> FieldCount, <FieldNum, Bits, FieldMetadata,
> Value>^FieldCount
> FieldMetadata --> ValueSize, <Byte>^ValueSize
My thought is instead of providing an ever-lengthening fixed menu of
field-types to choose from, that the menu should be per-index and the
codec should be indicated by an integer pointing to a spot on that menu.
> Does this feature interest the Lucene commiters ? Should I provide
> a patch in
> Jira? If not, is there any common place where to provide some patch
> for some
> Lucene hackers (ie not necessaraily commiters) ?
>
> So, Marvin, could you provide your patch about payload ?
I'm totally slammed this month because I got a talk accepted at OSCON
late and so I'm taking an unexpected week off in the midst of a very
busy time. There is not a patch per se, in any case.
> And is there a wiki page where there is a starting point about
> defining the
> future index format ?
http://wiki.apache.org/jakarta-lucene/FlexibleIndexing
Best,
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Flexible index format / Payloads Cont'd
Posted by Nicolas Lalevée <ni...@anyware-tech.com>.
Le Mercredi 05 Juillet 2006 13:23, Michael Busch a écrit :
> Doug Cutting wrote:
> > Marvin Humphrey wrote:
> >> IMO, this should wait. It's going to be freakishly difficult to get
> >> this stuff to work and maintain the commitments that Doug has laid
> >> out for backwards compatibility.
> >
> > Perhaps we can implement an all-new index format, in a new package.
> > An implementation of IndexReader can be provided to integrate with
> > existing search code. And the ability to add an IndexReader to an
> > index can be provided to upgrade existing indexes to the new format.
> > So the new code would not need to be able to process an old index: the
> > old code can continue to do that. Does that make sense? Is that
> > "freakishly difficult"? We'll need the ability to sniff a directory
> > and tell which version of index it contains, but that should not be
> > too hard.
> >
> > Doug
>
> +1. I agree that this approach would make it much easier to develop a
> new index format without the commitment of being backward-compatible. I
> would like to help working on a new index format. Who else is going to
> work on it?
I am also interested in improving Lucene too. I took time to respond to this
thread because I am quite new to Lucene, so I have to learn what you talked
about, in fact what a payload is. But here it is, I get it ! :)
What I have to do is a web application which will do some faceted search. My
current workaround is transforming each query in several queries, each by
categories. So I am interested of your current work.
I had also another issue with the field. Some field can have a type (integer,
date, string), and/or a language. It is typically some metadata on fields.
The quick workaround I did is to put the info in the field between some
square brackets. So I had to do a SkipPrefixTokenizer... dirt but almost
quick to implement.
Then I looked deeper in the Lucene file format, and I manage to introduce some
generic field metadata without breaking the file format compatibility. I just
used another bit of the "Bits" to mark that there is or not some metadata on
the field. And the metadata is stored next to it :
DocFieldData --> FieldCount, <FieldNum, Bits, FieldMetadata, Value>^FieldCount
FieldMetadata --> ValueSize, <Byte>^ValueSize
Does this feature interest the Lucene commiters ? Should I provide a patch in
Jira? If not, is there any common place where to provide some patch for some
Lucene hackers (ie not necessaraily commiters) ?
So, Marvin, could you provide your patch about payload ?
And is there a wiki page where there is a starting point about defining the
future index format ?
cheers,
Nicolas
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Flexible index format / Payloads Cont'd
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Jul 4, 2006, at 3:35 AM, Doug Cutting wrote:
> Marvin Humphrey wrote:
>> IMO, this should wait. It's going to be freakishly difficult to
>> get this stuff to work and maintain the commitments that Doug has
>> laid out for backwards compatibility.
>
> Perhaps we can implement an all-new index format, in a new package.
/me whistles low and grins.
org.apache.lucene.invindex? As in inverted index, InvIndexer, and
IIReader?
org.apache.lucene.ix? As in IxWriter and IxReader?
> An implementation of IndexReader can be provided to integrate with
> existing search code. And the ability to add an IndexReader to an
> index can be provided to upgrade existing indexes to the new
> format. So the new code would not need to be able to process an
> old index: the old code can continue to do that. Does that make
> sense? Is that "freakishly difficult"?
It's labor-intensive -- that's a lot of code, to write and to test!
But it would be a lot of code regardless, and it probably introduces
fewer bugs and complications putting everything in a new package than
interweaving so much new stuff into the existing code base.
The difficulty of keeping two packages afloat simultaneously will
depend on how loose the coupling is between org.apache.lucene.index
and the rest of Lucene.
> We'll need the ability to sniff a directory and tell which version
> of index it contains, but that should not be too hard.
As simple as touching a meaningless file, if need be. But I'll be
arguing for the introduction of a global field definition file, which
would serve just fine for that purpose. ;)
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Flexible index format / Payloads Cont'd
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jul 5, 2006, at 7:43 AM, Doug Cutting wrote:
> The folks working on Lucy are probably interested (Marvin & David).
> Perhaps the first thing should be to specify the file format, then
> implement it both in Java (for Lucene Java) and C (for Lucy).
> Independent implementations will provide good compatibility
> testing, and better validate the file format documentation.
>
> The specification could initially live in the wiki.
What about a formal electronic specification of the file format? I
hesitate to suggest XML because there is no good reason XML makes
sense as a general purpose "language" (*wink* to Mr. Bray), but that
is at least a common denominator among all languages. An formal
process-able format specification would allow code generation of low-
level I/O functions, and, of course, the documentation itself in web
presentable form.
The way the current file format documentation is structured in a
computer-friendly way to digest it would be sweet. Food for thought.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Flexible index format / Payloads Cont'd
Posted by Doug Cutting <cu...@apache.org>.
Michael Busch wrote:
> I would like to help working on a new index format.
> Who else is going to work on it?
The folks working on Lucy are probably interested (Marvin & David).
Perhaps the first thing should be to specify the file format, then
implement it both in Java (for Lucene Java) and C (for Lucy).
Independent implementations will provide good compatibility testing, and
better validate the file format documentation.
The specification could initially live in the wiki.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Flexible index format / Payloads Cont'd
Posted by Michael Busch <bu...@gmail.com>.
Doug Cutting wrote:
> Marvin Humphrey wrote:
>> IMO, this should wait. It's going to be freakishly difficult to get
>> this stuff to work and maintain the commitments that Doug has laid
>> out for backwards compatibility.
>
> Perhaps we can implement an all-new index format, in a new package.
> An implementation of IndexReader can be provided to integrate with
> existing search code. And the ability to add an IndexReader to an
> index can be provided to upgrade existing indexes to the new format.
> So the new code would not need to be able to process an old index: the
> old code can continue to do that. Does that make sense? Is that
> "freakishly difficult"? We'll need the ability to sniff a directory
> and tell which version of index it contains, but that should not be
> too hard.
>
> Doug
>
+1. I agree that this approach would make it much easier to develop a
new index format without the commitment of being backward-compatible. I
would like to help working on a new index format. Who else is going to
work on it?
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org