You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Michael Busch <bu...@gmail.com> on 2007/03/11 22:41:41 UTC

Re: Flexible indexing

Hi Grant,

I certainly agree that it would be great if we could make some progress 
and commit the payloads patch soon. I think it is quite independent from 
FI. FI will introduce different posting formats (see Wiki: 
http://wiki.apache.org/lucene-java/FlexibleIndexing). Payloads will be 
part of some of those formats, but not all (i. e. per-position payloads 
only make sense if positions are stored).

The only concern some people had was about the API the patch introduces. 
It extends Token and TermPositions. Doug's argument was, that if we 
introduce new APIs now but want to change them with FI, then it will be 
hard to support those APIs. I think that is a valid point, but at the 
same time it slows down progress to have to plan ahead in too many 
directions. That's why I'd vote for marking the new APIs as experimental 
so that people can try them out at own risk.
If we could agree on that approach then I'd go ahead and submit an 
updated payloads patch in the next days, that applies cleanly on the 
current trunk and contains the additional warnings in the javadocs.

In regard of FI and 662 however I really believe we should split it up 
and plan ahead (in a way I mentioned already), so that we have more 
isolated patches. It is really great that we have 662 already (Nicolas, 
thank you so much for your hard work, I hope you'll keep working with us 
on FI!!). We'll probably use some of that code, and it will definitely 
be helpful.

Michael

Grant Ingersoll wrote:
> Hi Michael,
>
> This is very good.  I know 662 is different, just wasn't sure if 
> Nicolas patch was meant to be applied after 662, b/c I know we had 
> discussed this before.
>
> I do agree with you about planning this out, but I also know that 
> patches seem to motivate people the best and provide a certain 
> concreteness to it all.  I mostly started asking questions on these 
> two issues b/c I wanted to spur some more discussion and see if we can 
> get people motivated to move on it.
>
> I was hoping that I would be able to apply each patch to two different 
> checkouts so I could start seeing where the overlap is and how they 
> could fit together (I also admit I was procrastinating on my ApacheCon 
> talk...).  In the new, flexible world, the payloads implementation 
> could be a separate implementation of the indexing or it could be part 
> of the core/existing file format implementation.  Sometimes I just 
> need to get my hands on the code to get a real feel for what I feel is 
> the best way to do it.
>
> I agree about the XML storage for Index information.  We do that in 
> our in-house wrapper around Lucene, storing info about the language, 
> analyzer used, etc.  We may also want a binary index-level storage 
> capability.  I know most people just create a single document usually 
> to store binary info about the index, but an binary storage might be 
> good too.
>
> Part of me says to apply the Payloads patch now, as it provides a lot 
> of bang for the buck and I think the FI is going to take a lot longer 
> to hash out.  However, I know that it may pin us in or force us to 
> change things for FI.  Ultimately, I would love to see both these 
> features for the next release, but that isn't a requirement.  Also, on 
> FI, I would love to see two different implementations of whatever API 
> we choose before releasing it, as I always find two implementations of 
> an Interface really work out the API details.
>
> -Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible indexing

Posted by Nicolas Lalevée <ni...@anyware-tech.com>.

Le Dimanche 11 Mars 2007 22:41, Michael Busch a écrit :
> Hi Grant,
>
> I certainly agree that it would be great if we could make some progress
> and commit the payloads patch soon. I think it is quite independent from
> FI. FI will introduce different posting formats (see Wiki:
> http://wiki.apache.org/lucene-java/FlexibleIndexing). Payloads will be
> part of some of those formats, but not all (i. e. per-position payloads
> only make sense if positions are stored).
>
> The only concern some people had was about the API the patch introduces.
> It extends Token and TermPositions. Doug's argument was, that if we
> introduce new APIs now but want to change them with FI, then it will be
> hard to support those APIs. I think that is a valid point, but at the
> same time it slows down progress to have to plan ahead in too many
> directions. That's why I'd vote for marking the new APIs as experimental
> so that people can try them out at own risk.
> If we could agree on that approach then I'd go ahead and submit an
> updated payloads patch in the next days, that applies cleanly on the
> current trunk and contains the additional warnings in the javadocs.
>
>
> In regard of FI and 662 however I really believe we should split it up
> and plan ahead (in a way I mentioned already), so that we have more
> isolated patches. It is really great that we have 662 already (Nicolas,
> thank you so much for your hard work, I hope you'll keep working with us
> on FI!!). We'll probably use some of that code, and it will definitely
> be helpful.

thanks ! :)

About the code split you are talking about, I definitively agree. Here is what 
will contain the three parts :
1) index format concept :
- there is an interface defining it, just for now handling the filename 
extensions.
- modify the directory abstract class and the implementations to be the 
container of the index format.
- modify the SegmentInfos class to do some check about the opened index format 
and the index format defined in the Directory class.
- modify the writer to make it check format conflits while adding raw indexes
2) extensibility of the store reader/writer :
- add to the previous interface some new entry points : a FieldsReader and a 
FieldsWriter.
- split the current FieldsReader and FieldsWriter in two parts : the part 
which will be still handled by Lucene, and the extendable ones which will be 
instanciated by a DefaultIndexFormat.
- split the implementation of Field in two parts : the Field and a FieldData, 
so the user will be able to define his custom field-data java object.
3) New: extensibility of the posting reader/writer
this is just a draft for now, but here is what was done :
- move Posting from a inner class to a public class
- make TermInfo handling a pool of "pointers" : the default implementation has 
two, the frq one and the prx one.
- extract the posting writing from DocumentWriter into a DefaultPostingWriter.

I can provide a patch for the first step.

cheers,
Nicolas

>
> Michael
>
> Grant Ingersoll wrote:
> > Hi Michael,
> >
> > This is very good.  I know 662 is different, just wasn't sure if
> > Nicolas patch was meant to be applied after 662, b/c I know we had
> > discussed this before.
> >
> > I do agree with you about planning this out, but I also know that
> > patches seem to motivate people the best and provide a certain
> > concreteness to it all.  I mostly started asking questions on these
> > two issues b/c I wanted to spur some more discussion and see if we can
> > get people motivated to move on it.
> >
> > I was hoping that I would be able to apply each patch to two different
> > checkouts so I could start seeing where the overlap is and how they
> > could fit together (I also admit I was procrastinating on my ApacheCon
> > talk...).  In the new, flexible world, the payloads implementation
> > could be a separate implementation of the indexing or it could be part
> > of the core/existing file format implementation.  Sometimes I just
> > need to get my hands on the code to get a real feel for what I feel is
> > the best way to do it.
> >
> > I agree about the XML storage for Index information.  We do that in
> > our in-house wrapper around Lucene, storing info about the language,
> > analyzer used, etc.  We may also want a binary index-level storage
> > capability.  I know most people just create a single document usually
> > to store binary info about the index, but an binary storage might be
> > good too.
> >
> > Part of me says to apply the Payloads patch now, as it provides a lot
> > of bang for the buck and I think the FI is going to take a lot longer
> > to hash out.  However, I know that it may pin us in or force us to
> > change things for FI.  Ultimately, I would love to see both these
> > features for the next release, but that isn't a requirement.  Also, on
> > FI, I would love to see two different implementations of whatever API
> > we choose before releasing it, as I always find two implementations of
> > an Interface really work out the API details.
> >
> > -Grant
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org

-- 
Nicolas LALEVÉE
Solutions & Technologies
ANYWARE TECHNOLOGIES
Tel : +33 (0)5 61 00 52 90
Fax : +33 (0)5 61 00 51 46
http://www.anyware-tech.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible indexing

Posted by Michael Busch <bu...@gmail.com>.

Grant Ingersoll wrote:
>
>> In regard of FI and 662 however I really believe we should split it 
>> up and plan ahead (in a way I mentioned already), so that we have 
>> more isolated patches. It is really great that we have 662 already 
>> (Nicolas, thank you so much for your hard work, I hope you'll keep 
>> working with us on FI!!). We'll probably use some of that code, and 
>> it will definitely be helpful.
>>
>
> +1  I think this makes a lot of sense.  We have been deliberating 
> these changes for some time, so no reason to hurry.  I don't think 
> they are urgent, yet they really will give us more flexibility and 
> more capabilities for more people, so it will be a good thing to have.
>

Right, we don't have to hurry. But still it would be cool to have some 
of the FI features in the next release and once we start (now!) we 
should try to keep the momentum going!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible indexing

Posted by Grant Ingersoll <gr...@gmail.com>.

On Mar 11, 2007, at 5:41 PM, Michael Busch wrote:

> Hi Grant,
>
> I certainly agree that it would be great if we could make some  
> progress and commit the payloads patch soon. I think it is quite  
> independent from FI. FI will introduce different posting formats  
> (see Wiki: http://wiki.apache.org/lucene-java/FlexibleIndexing).  
> Payloads will be part of some of those formats, but not all (i. e.  
> per-position payloads only make sense if positions are stored).
>

Yep, I agree.

> The only concern some people had was about the API the patch  
> introduces. It extends Token and TermPositions. Doug's argument  
> was, that if we introduce new APIs now but want to change them with  
> FI, then it will be hard to support those APIs. I think that is a  
> valid point, but at the same time it slows down progress to have to  
> plan ahead in too many directions. That's why I'd vote for marking  
> the new APIs as experimental so that people can try them out at own  
> risk.
> If we could agree on that approach then I'd go ahead and submit an  
> updated payloads patch in the next days, that applies cleanly on  
> the current trunk and contains the additional warnings in the  
> javadocs.
>

+1.

>
> In regard of FI and 662 however I really believe we should split it  
> up and plan ahead (in a way I mentioned already), so that we have  
> more isolated patches. It is really great that we have 662 already  
> (Nicolas, thank you so much for your hard work, I hope you'll keep  
> working with us on FI!!). We'll probably use some of that code, and  
> it will definitely be helpful.
>

+1  I think this makes a lot of sense.  We have been deliberating  
these changes for some time, so no reason to hurry.  I don't think  
they are urgent, yet they really will give us more flexibility and  
more capabilities for more people, so it will be a good thing to have.


> Michael
>
> Grant Ingersoll wrote:
>> Hi Michael,
>>
>> This is very good.  I know 662 is different, just wasn't sure if  
>> Nicolas patch was meant to be applied after 662, b/c I know we had  
>> discussed this before.
>>
>> I do agree with you about planning this out, but I also know that  
>> patches seem to motivate people the best and provide a certain  
>> concreteness to it all.  I mostly started asking questions on  
>> these two issues b/c I wanted to spur some more discussion and see  
>> if we can get people motivated to move on it.
>>
>> I was hoping that I would be able to apply each patch to two  
>> different checkouts so I could start seeing where the overlap is  
>> and how they could fit together (I also admit I was  
>> procrastinating on my ApacheCon talk...).  In the new, flexible  
>> world, the payloads implementation could be a separate  
>> implementation of the indexing or it could be part of the core/ 
>> existing file format implementation.  Sometimes I just need to get  
>> my hands on the code to get a real feel for what I feel is the  
>> best way to do it.
>>
>> I agree about the XML storage for Index information.  We do that  
>> in our in-house wrapper around Lucene, storing info about the  
>> language, analyzer used, etc.  We may also want a binary index- 
>> level storage capability.  I know most people just create a single  
>> document usually to store binary info about the index, but an  
>> binary storage might be good too.
>>
>> Part of me says to apply the Payloads patch now, as it provides a  
>> lot of bang for the buck and I think the FI is going to take a lot  
>> longer to hash out.  However, I know that it may pin us in or  
>> force us to change things for FI.  Ultimately, I would love to see  
>> both these features for the next release, but that isn't a  
>> requirement.  Also, on FI, I would love to see two different  
>> implementations of whatever API we choose before releasing it, as  
>> I always find two implementations of an Interface really work out  
>> the API details.
>>
>> -Grant
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org