You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by karl wettin <ka...@gmail.com> on 2007/07/26 21:56:27 UTC

Last attempt

Some time ago I tried to introduce LUCENE-581, a new consumer top  
layer, the core changes required by LUCENE-550, my InstantaitedIndex.  
I would still like to see this a part of the core. It is completely  
backwards compatible but contains a few small changes that seems to  
be convtroversial, and I'm honestly not sure why:

* Complete definalization of Term, Document and IndexReader.
* IndexWriterInterface

In my eyes, the only thing these things do are to limit Lucene  
development to the file-centric Directory store. There is nothing  
wrong with Dicretory, I just want to be able to use the same code for  
any store design of my chooise. I want unison index handling, no  
matter the implementation. One line of code that switch between  
Directory, BDB, MemoryIndex, InstantiatedIndex or what not.

This post is about InstantiatedIndex and the things I built upon it.  
As time it passed I just gave up on keeping them up to date. It is in  
use at this one place where it is just spinning on with no need to  
update, stuck to Lucene 2.0 or so. We are now getting close to Lucene  
3.0 and I would hate to see this code get lost in time.

It has so many neat features. Beeing really really fast on small  
corpuses is just one.

In essense the design is similar to contrib/MemoryIndex, but it can  
hold multiple documents.

The definalization and interface also allows for index insert/delete/ 
optimization notifications.

These two features combined yeilded in an active cache (not really  
used in any project, just a proof-of-concept I experimented with on a  
site where a lot of users place the exact same query) that update  
cached results only when affected by new data. Could be done with  
MemoryIndex too, but not as fast as InstantiatedIndex can handle  
batches of documents.

One can however do alot of other things with it.

In LUCENE-626 I also use InstantiatedIndex, getting some 10-20 times  
faster response times from my contrib/spellcheck augmentation than  
when using a RAMDirectory.

There are more features and potentially cool things one might want to  
consider in the 550-patch/UML diagram.


Would the changes to the core InstantiatedIndex require ever be  
committed? Then I could sit down and bring these patches up to date.  
Otherwise I'll just let them become some depricated artifact I use  
for a couple of things such as spellchecking, rather than a neat  
augmentation of Lucene I could use for any future development.


-- 
karl


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Last attempt

Posted by karl wettin <ka...@gmail.com>.

27 jul 2007 kl. 02.18 skrev Grant Ingersoll:

> or maybe there is a way to separate out the interface changes from  
> the InstantiatedIndex stuff?

One thing I came to think of is to use the IndexReader/IndexWriter  
"pipe" available in InstantiatedIndex. I.e. create a Directory, add  
documents via IndexWriter, and then pass a new IndexReader to  
InstantiatedIndex for merge. Even though this will slow things down  
when adding things to an index, I think this is an acceptable solution.

That would only remove the IndexWriterInterface though. It would  
still require the definalization of Document, Term and IndexReader.

And by removing IndexWriterInterface one somewhat cripples the  
NotifiableIndex.

All in all I don't think it would count as gaining something.


-- 
karl


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Last attempt

Posted by karl wettin <ka...@gmail.com>.

27 jul 2007 kl. 02.18 skrev Grant Ingersoll:

> I think one thing I wonder about is if there is a way it could be a  
> standalone contrib package or maybe there is a way to separate out  
> the interface changes from the InstantiatedIndex stuff? That way  
> you could lobby for InstIndex as a contrib, and then a separate  
> patch for the API changes.

That is sort of what I tried with 581. I want to point out that one  
should no longer be looking at that patch. Since it was set no-fix  
and merged back with 550 I have replaced the generalization with  
aggregations (i.e. Directory does not extend index, there is a new  
class, DirectoryIndex, that points at an instance of Directory) and  
some other things in order to minimize the impact on core. It is  
visible in the UML diagram.

I'd be more than happy to spend the week required to bring 550 up to  
speed with the trunk, clean it up and split it up in multiple  
patches, but only if I knew that the core changes would be accepted.

Something like this:

* Core changes (complete definalization of Term, Document and  
IndexReader + IndexWriterInterface)
* Index (factory class for reader and writer for unison index handling)
* InstantiatedIndex (extends Index)
* NotifiableIndex (decoration layer on top of index)
* Active results cache and the other stuff that is just ideas on top  
of the two prior items.


> By the looks of the issue, you had a lot of comments and good  
> input, do you feel all the issues have all been addressed?  Just  
> asking...

I do. At least I did the last time I was looking at the code, 6  
months ago or so.

>
> Also, does Mike M's changes affect how you would do these things?

Not sure what you are refering to. 550 is however fairly isolated  
from the rest of Lucene.



-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Last attempt

Posted by Grant Ingersoll <gs...@apache.org>.

Hi Karl,

I have seen this and have always thought I should spend some time on  
it, but then didn't get to it.  That isn't to say it isn't useful.  I  
think one thing I wonder about is if there is a way it could be a  
standalone contrib package or maybe there is a way to separate out  
the interface changes from the InstantiatedIndex stuff? That way you  
could lobby for InstIndex as a contrib, and then a separate patch for  
the API changes.  And please feel free to tell me they can't, I am  
just wondering out loud here trying to find a path to take so it  
isn't lost.

I think there are some reasons Document is final, although I am not  
sure they can't be handled through a buyer beware issue.  If you  
search the archives for Document and final I think you will see the  
arguments.  There is also an issue in JIRA related to it (https:// 
issues.apache.org/jira/browse/LUCENE-778) so you are not the only one  
asking for it (I see you commented on that one)

By the looks of the issue, you had a lot of comments and good input,  
do you feel all the issues have all been addressed?  Just asking...

Also, does Mike M's changes affect how you would do these things?

Mostly just me trying to figure out this patch.  I, too, would hate  
to see it whither, but I can't make any promises on time, either.  By  
the way, the Flexible Indexing stuff from Nicolas, et. al is in this  
same boat in my mind.  Would love to have 'em in Lucene, but don't  
have the cycles to do it.  Sigh.

-Grant

On Jul 26, 2007, at 3:56 PM, karl wettin wrote:

> Some time ago I tried to introduce LUCENE-581, a new consumer top  
> layer, the core changes required by LUCENE-550, my  
> InstantaitedIndex. I would still like to see this a part of the  
> core. It is completely backwards compatible but contains a few  
> small changes that seems to be convtroversial, and I'm honestly not  
> sure why:
>
> * Complete definalization of Term, Document and IndexReader.
> * IndexWriterInterface
>
> In my eyes, the only thing these things do are to limit Lucene  
> development to the file-centric Directory store. There is nothing  
> wrong with Dicretory, I just want to be able to use the same code  
> for any store design of my chooise. I want unison index handling,  
> no matter the implementation. One line of code that switch between  
> Directory, BDB, MemoryIndex, InstantiatedIndex or what not.
>
> This post is about InstantiatedIndex and the things I built upon  
> it. As time it passed I just gave up on keeping them up to date. It  
> is in use at this one place where it is just spinning on with no  
> need to update, stuck to Lucene 2.0 or so. We are now getting close  
> to Lucene 3.0 and I would hate to see this code get lost in time.
>
> It has so many neat features. Beeing really really fast on small  
> corpuses is just one.
>
> In essense the design is similar to contrib/MemoryIndex, but it can  
> hold multiple documents.
>
> The definalization and interface also allows for index insert/ 
> delete/optimization notifications.
>
> These two features combined yeilded in an active cache (not really  
> used in any project, just a proof-of-concept I experimented with on  
> a site where a lot of users place the exact same query) that update  
> cached results only when affected by new data. Could be done with  
> MemoryIndex too, but not as fast as InstantiatedIndex can handle  
> batches of documents.
>
> One can however do alot of other things with it.
>
> In LUCENE-626 I also use InstantiatedIndex, getting some 10-20  
> times faster response times from my contrib/spellcheck augmentation  
> than when using a RAMDirectory.
>
> There are more features and potentially cool things one might want  
> to consider in the 550-patch/UML diagram.
>
>
> Would the changes to the core InstantiatedIndex require ever be  
> committed? Then I could sit down and bring these patches up to  
> date. Otherwise I'll just let them become some depricated artifact  
> I use for a couple of things such as spellchecking, rather than a  
> neat augmentation of Lucene I could use for any future development.
>
>
> -- 
> karl
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org