You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by David Causse <dc...@spotter.com> on 2009/10/06 18:54:06 UTC

Forwarded: InstantiatedIndex questions

Hi,

Karl prefer to answer on the ml so here is some informations he asked on
how we use InstantiatedIndex.

----- Forwarded message from David Causse <dc...@spotter.com> -----

Date: Tue, 6 Oct 2009 15:45:57 +0200
From: David Causse <dc...@spotter.com>
To: Karl Wettin <ka...@apache.org>
Subject: Re: InstatiatedIndex questions

Hi,

sorry for the delay.

We upgraded to the new Attribute API and InstantiatedIndex didn't
support it, now with 2.9 InstantiatedIndex works great with this API.
We use it cause we build volatile indexes on small doc set 1 to 200 and
apply massive complex queries. It is done in a document flow over
a messaging architecture, it's a sort of routing table.
We use only IR and IW, we have build our query system which is a bit
similar to SpanQuery cause we make intensive use of term positionning.
We do proximity searches based on standard term position but also
with some payload information like phrase id and/or paragraph id and
some generic stuff that permits to add relationships inside the index.

So what is important for us : is fast indexing time and very fast query.
Query lucene for us means 
	- termEnum iteration to do query rewrite/optimizations
	- termDocs/termPosition iteration

On the index time InstantiatedIndex is behind RAMDirectory, but the time
gained over queries make it better (for what I see it can be 2 times
faster).

InstantiatedIndex will be our default volatile mini index store for our
next production release.

The need for serialization is deprecated now, we prefer to re-index
pre-analyzed token stream and keep control of bits with Externalizable.

Whe should have other needs of this index but the lack of addIndexes
support make it impossible for us to use it in other situations. So we
continue to use RAMDirectory in such situations.

Do you think we could reach RAMDirectory index time by tweaking some initialCap
stuff inside java.util.Collections you use?

Many thanks for your excellent work.

PS. I posted some (not really usefull) debug output to the lucene-users
ml.

On Fri, Dec 12, 2008 at 04:15:56PM +0100, Karl Wettin wrote:
> Hi David,
>
> the problems you reported are now committed to the trunk. As  
> InstantiatedIndex is a new module it would be very intersting to hear  
> how InstantiatedIndex works for you and perhaps a little bit about how  
> you use it.
>
>
>     karl
>
> 19 nov 2008 kl. 15.02 skrev David Causse:
>
> > Hi Karl,
> >
> > The reset() problem is not very problematic I can adapt our  
> > TokenStreams.
> > For the Serialization : as we need to share very small indexes (200  
> > docs max) in a cluster we need to serialize something.
> > I was planning to use the Java Serialization with maybe some  
> > compression on the resulting byte[] and as InstantiatedIndex is  
> > Serializable I was hoping to use the perf gain of your implementation 
> > in our context.
> > I will fix my working copy as you suggested.
> >
> > Thank you.
> >
> > David.
> >
> > karl wettin a écrit :
> >> Hi David,
> >>
> >> thanks for the report! I suppose you speak of IndexWriter vs
> >> InstantiatedIndexWriter? These are definitely considered discrepancy
> >> problems. I've created a new issue in the tracker:
> >> http://issues.apache.org/jira/browse/LUCENE-1462
> >>
> >> For what reason do you try to serialize the InstantatedIndex? Could
> >> you perhaps use FSDirectory and IndexWriter instead, and then each
> >> time you update that index you replace your InstantiatedIndex with a
> >> new one constructed using the IndexReader argumented constructor of
> >> InstantiatedIndex?
> >>
> >> I'm afraid that I'm rather busy at the moment but I'll try to fix it
> >> ASAP. It should however be rather easy to fix if you just want to
> >> solve the specific problem: reset all pre-tokenized streams before
> >> they are tokenized in InstantiatedIndexWriter#addDocument and make
> >> TermVectorOffsetInfo implement Serializable.
> >>
> >>
> >>     karl
> >>
> >> On Wed, Nov 19, 2008 at 11:00 AM, David Causse <dc...@spotter.com> 
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> Here are some differences I noticed between InstanciatedIndex and
> >>> RAMDirectory :
> >>>
> >>> - RAMDirectory seems to do a reset on tokenStreams the first time,  
> >>> this
> >>> permits to initialise some objects before starting streaming,
> >>> InstanciatedIndex does not.
> >>> - I can Serialize a RAMDirectory but I cannot on a  
> >>> InstantiatedIndex because
> >>> of : java.io.NotSerializableException:
> >>> org.apache.lucene.index.TermVectorOffsetInfo
> >>>
> >>> Do you consider this as problems or normal features?
> >>>
> >>> Thank you.
> >>>
> >>> David.
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>>
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >>
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
>

-- 
David Causse
Spotter
http://www.spotter.com/

----- End forwarded message -----

-- 
David Causse
Spotter
http://www.spotter.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: InstantiatedIndex questions

Posted by David Causse <dc...@spotter.com>.

On Tue, Oct 06, 2009 at 07:51:44PM +0200, Karl Wettin wrote:
>
> 6 okt 2009 kl. 18.54 skrev David Causse:
>
> David, your timing couldn't be better. Just the other day I proposed  
> that we deprecate InstantiatedIndexWriter. The sum of the reasons to  
> this is that I'm a bit lazy. Your mail makes me reconsider.
>
> https://issues.apache.org/jira/browse/LUCENE-1948

Well, so you intended to make InstantiatedIndex impossible to use from
scratch. What is very nice with the current implementation is that it
conform to the "normal" lucene usage : index with IW and query with IR.
It make it very easy to adopt your implementation with legacy
applications.
If it is immutable it's like String, I have to deal with
StringBuffer/StringBuilder, that's nice in some ways but it has some
drawbacks, it's maybe why all efficient internal analysis API in lucene
permits the use of char[].

So in the lucene world with II, RAMDirectory will be my StringBuilder,
so I'll have to wrap InstantiatedIndex to use a RAMDirectory as a
buffer. Like this :

private IndexWriter buffer;
private readerIsValid = false;
private InstanciatedIndexReader reader;

public IIWrapper(Analyzer a) {
	buffer = new IndexWriter(new RAMDirectory(), a, MaxFieldLength.UNLIMITED);
}

public void /* writeLock */ addDocument(Document doc) {
	buffer.addDocument(doc);
	readerIsValid = false;
}

public IndexReader /* readLock */ getReader() {
	if(!readerIsValid) {
		/* writeLock */
		reader = new InstantiadedIndex(buffer.getReader()).indexReaderFactory();
		readerIsvalid = true;
	}
	return reader;
}

So I'll have best indexation time but the first query will suffer the
IIR creation. It is maybe better than use IIW. But IMHO (as a user point
of view) it's very convenient to have the choice of multiple store in
lucene, and it's pretty cool to use them in the same way. If you take a look
at MemoryIndex, description is very attractive but it's too far (again
IMHO) from the lucene API.

>
> > On the index time InstantiatedIndex is behind RAMDirectory, but the  
> > time
>
> Would you mind benchmarking some for me using your corpora? The issue  
> suggests that people use the InstantiatedIndex(IndexReader) constructor 
> to create the index rather than using InstantiatedIndexWriter. Is it way 
> slower for you to produce the index using RAMDirectory/IndexWriter and 
> pass an IndexReader to InstantiatedIndex?
>
>
> This is what the package level javadocs says about  
> InstantiatedIndexWriter:
>
> "Hardly any effort has been put in to optimizing the  
> InstantiatedIndexWriter, only minimizing the amount of time needed to  
> write-lock the index has been considered."
>
> I'm sure there are ways to speed it up, I just never managed to find the 
> time to look in to it. I never really used IIW.

We've processed huge number of documents and didn't see any problems.
But we don't use all the possibilities lucene stores have to offer.

>
>
> It might be worth mentioning that when InstantiatedIndex#commit returns 
> it has yeilded an optimized "single segment" index. This is not quite how 
> a Directory/IndexWriter acts.

When I have some times I'll try to bench our system without IIW.

>
> > gained over queries make it better (for what I see it can be 2 times
> > faster).
> >
> > InstantiatedIndex will be our default volatile mini index store for  
> > our
> > next production release.
>
> Very cool!!
>
> > Whe should have other needs of this index but the lack of addIndexes
> > support make it impossible for us to use it in other situations. So we
> > continue to use RAMDirectory in such situations.
>
> Have you considered using multiple InstantiatedIndex and a MultiReader? 
> That would pretty much be the same thing, just that the store wouldn't be 
> quite as optimized. It would definitly use more RAM than if it was the 
> same index. You could of course also pass this MultiReader to a new 
> InstantiatedIndex. I have no real clue about the difference in speed and 
> RAM consuption between these solutions so you should benchmark all 
> solutions.

Well, in fact with 2.9 there is awesome number of solutions for us.  We
might try the new getReader() on IW, we didn't use MultiReader but it
seems to be a very convenient solution also. I have to take time to think
about it.

Just a side note for other users who may think about using
InstantiadedIndex, we've seen that our process that use II can be as I
said 2 times faster. You have to consider that it is exactly if I said,
"Hey I switched from MySQL to PostgreSQL and my app is 2 times faster". 

>
> > Do you think we could reach RAMDirectory index time by tweaking some  
> > initialCap
> > stuff inside java.util.Collections you use?
>
>
> Maybe. But I think it would be a relatively small gain. But don't take  
> my words for granted, benchmark it.
>
> Using the InstantiatedIndex(IndexReader) constructor will create rather 
> optimal size of the collections.
>
> As for InstantiatedIndexWriter I think it's pretty much only the  
> transient collections in #commit that will help you, my guess is that  
> you should expemient with the dirtyTerms and termsByText attributes.  
> Count the number of terms in your complete index and see how much it  
> speeds thing up by creating the collections with this size from the  
> start.

Well I had a quick look to IIW and your collections size are already near our
average.

>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

-- 
David Causse
Spotter
http://www.spotter.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re:InstantiatedIndex questions

Posted by Karl Wettin <ka...@gmail.com>.

6 okt 2009 kl. 18.54 skrev David Causse:

David, your timing couldn't be better. Just the other day I proposed  
that we deprecate InstantiatedIndexWriter. The sum of the reasons to  
this is that I'm a bit lazy. Your mail makes me reconsider.

https://issues.apache.org/jira/browse/LUCENE-1948

> On the index time InstantiatedIndex is behind RAMDirectory, but the  
> time

Would you mind benchmarking some for me using your corpora? The issue  
suggests that people use the InstantiatedIndex(IndexReader)  
constructor to create the index rather than using  
InstantiatedIndexWriter. Is it way slower for you to produce the index  
using RAMDirectory/IndexWriter and pass an IndexReader to  
InstantiatedIndex?


This is what the package level javadocs says about  
InstantiatedIndexWriter:

"Hardly any effort has been put in to optimizing the  
InstantiatedIndexWriter, only minimizing the amount of time needed to  
write-lock the index has been considered."

I'm sure there are ways to speed it up, I just never managed to find  
the time to look in to it. I never really used IIW.


It might be worth mentioning that when InstantiatedIndex#commit  
returns it has yeilded an optimized "single segment" index. This is  
not quite how a Directory/IndexWriter acts.

> gained over queries make it better (for what I see it can be 2 times
> faster).
>
> InstantiatedIndex will be our default volatile mini index store for  
> our
> next production release.

Very cool!!

> Whe should have other needs of this index but the lack of addIndexes
> support make it impossible for us to use it in other situations. So we
> continue to use RAMDirectory in such situations.

Have you considered using multiple InstantiatedIndex and a  
MultiReader? That would pretty much be the same thing, just that the  
store wouldn't be quite as optimized. It would definitly use more RAM  
than if it was the same index. You could of course also pass this  
MultiReader to a new InstantiatedIndex. I have no real clue about the  
difference in speed and RAM consuption between these solutions so you  
should benchmark all solutions.

> Do you think we could reach RAMDirectory index time by tweaking some  
> initialCap
> stuff inside java.util.Collections you use?


Maybe. But I think it would be a relatively small gain. But don't take  
my words for granted, benchmark it.

Using the InstantiatedIndex(IndexReader) constructor will create  
rather optimal size of the collections.

As for InstantiatedIndexWriter I think it's pretty much only the  
transient collections in #commit that will help you, my guess is that  
you should expemient with the dirtyTerms and termsByText attributes.  
Count the number of terms in your complete index and see how much it  
speeds thing up by creating the collections with this size from the  
start.




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org