You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Wolfgang Hoschek <wh...@lbl.gov> on 2005/05/02 04:20:47 UTC

Re: [Performance] Streaming main memory indexing of single strings

I've uploaded code that now runs against the current SVN, plus junit 
test cases, plus some minor internal updates to the functionality 
itself. For details see 
http://issues.apache.org/bugzilla/show_bug.cgi?id=34585

Be prepared for the testcases to take some minutes to complete - don't 
hit CTRL-C :-)
Erik, if nobody objects, can you please put this into a contrib area, 
e.g. module "memory" in org.apache.lucene.index.memory, or similar?
Thanks,
Wolfgang.

On Apr 27, 2005, at 10:30 AM, Erik Hatcher wrote:

> On Apr 27, 2005, at 12:22 PM, Doug Cutting wrote:
>> Erik Hatcher wrote:
>>> I'm not quite sure  where to put MemoryIndex - maybe it deserves to 
>>> stand on its own in a  new contrib area?
>>
>> That sounds good to me.
>
> Ok... once Wolfgang gives me one last round up updates (JUnit tests 
> instead of main() and upgrade it to work with trunk) I'll do that.  I 
> had put it in miscellaneous but will create its only sub-contrib area 
> instead.
>
>>
>>> Or does it make sense to put this into misc (still  in 
>>> sandbox/misc)?  Or where?
>>
>> Isn't the goal for sandbox/ to go away, replaced with contrib/?
>
> Yes.  In fact, I moved the last relevant piece 
> (sandbox/contributions/miscellaneous) to contrib last night.   I think 
> both the parsers and XML-Indexing-Demo found in the sandbox are not 
> worth preserving.  Anyone feel that these pieces left in the sandbox 
> should be preserved?
>
> 	Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [Performance] Streaming main memory indexing of single strings

Posted by Wolfgang Hoschek <wh...@lbl.gov>.
>>
>> The version I sent returns in O(1), if performance was your concern. 
>> Or
>> did you mean something else?
>
> Since 0 is the only document number in the index, a
>
> return target == 0;
>
> might be nice for skipTo(). It doesn't really help performance, though,
> and the next() works just as well.
>
> Regards,
> Paul Elschot.
>


It's not just "return target == 0". Internally next() switches a 
hasNext flag to false, and that makes it a safer operation...

BTW, did you give the unit tests a shot? Or even better, run it against 
some of your own queries/test data? That might help to shake out other 
bugs that might potentially be lurking in remote corners...

Cheers,
Wolfgang.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [Performance] Streaming main memory indexing of single strings

Posted by Paul Elschot <pa...@xs4all.nl>.
On Monday 02 May 2005 23:38, Wolfgang Hoschek wrote:
> > Yes, the svn trunk uses skipTo more often than 1.4.3.
> >
> > However, your implementation of skipTo() needs some improvement.
> > See the javadoc of skipTo of class Scorer:
> >
> > http://lucene.apache.org/java/docs/api/org/apache/lucene/search/ 
> > Scorer.html#skipTo(int)
> 
> What's wrong with the version I sent? Remeber that there can be at most  
> one document in a MemoryIndex. So the "target" parameter can safely be  
> ignored, as far as I can see.

Correct, I did not realize that there is only a single doc in the index.

> 
> >
> > In case the underlying scorers provide skipTo() it's even better to  
> > use that.
> >
> 
> The version I sent returns in O(1), if performance was your concern. Or  
> did you mean something else?

Since 0 is the only document number in the index, a

return target == 0;

might be nice for skipTo(). It doesn't really help performance, though,
and the next() works just as well.

Regards,
Paul Elschot.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [Performance] Streaming main memory indexing of single strings

Posted by Wolfgang Hoschek <wh...@lbl.gov>.
> Yes, the svn trunk uses skipTo more often than 1.4.3.
>
> However, your implementation of skipTo() needs some improvement.
> See the javadoc of skipTo of class Scorer:
>
> http://lucene.apache.org/java/docs/api/org/apache/lucene/search/ 
> Scorer.html#skipTo(int)

What's wrong with the version I sent? Remeber that there can be at most  
one document in a MemoryIndex. So the "target" parameter can safely be  
ignored, as far as I can see.

>
> In case the underlying scorers provide skipTo() it's even better to  
> use that.
>

The version I sent returns in O(1), if performance was your concern. Or  
did you mean something else?

Wolfgang.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [Performance] Streaming main memory indexing of single strings

Posted by Paul Elschot <pa...@xs4all.nl>.
Wolfgang,

On Monday 02 May 2005 23:21, Wolfgang Hoschek wrote:
> Finally found and fixed the bug!
> The fix is simply to replace MemoryIndex.MemoryIndexReader skipTo()  
> with the following:
> 
> 				public boolean skipTo(int target) {
> 					if (DEBUG) System.err.println(".skipTo: " + target);
> 					return next();
> 				}
> 
> Apparently lucene-1.4.3 didn't use skipTo() in a way that triggered the  
> bug, while SVN does.

Yes, the svn trunk uses skipTo more often than 1.4.3.

However, your implementation of skipTo() needs some improvement.
See the javadoc of skipTo of class Scorer:

http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Scorer.html#skipTo(int)

In case the underlying scorers provide skipTo() it's even better to use that.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [Performance] Streaming main memory indexing of single strings

Posted by Wolfgang Hoschek <wh...@lbl.gov>.
This is what I have as scoring calculation, and it seems to do exactly  
what lucene-1.4.3 does because the tests pass.

		public byte[] norms(String fieldName) {
			if (DEBUG) System.err.println("MemoryIndexReader.norms: " +  
fieldName);
			Info info = getInfo(fieldName);
			int numTokens = info != null ? info.numTokens : 0;
			byte norm =  
Similarity.encodeNorm(getSimilarity().lengthNorm(fieldName,  
numTokens));
			return new byte[] {norm};
		}
	
		public void norms(String fieldName, byte[] bytes, int offset) {
			if (DEBUG) System.err.println("MemoryIndexReader.norms: " +  
fieldName + "*");
			byte[] norms = norms(fieldName);
			System.arraycopy(norms, 0, bytes, offset, norms.length);
		}

		private Similarity getSimilarity() {
			return searcher.getSimilarity(); // this is the normal lucene  
IndexSearcher
		}
		

Can anyone see what's wrong with it for lucene current SVN? Should my  
calculation now be done differently? If so, how?
Thanks for any clues into the right direction.
Wolfgang.

On May 2, 2005, at 9:05 AM, Wolfgang Hoschek wrote:

> I'm looking at it right now. The tests pass fine when you put  
> lucene-1.4.3.jar instead of the current lucene onto the classpath  
> which is what I've been doing so far. Something seems to have changed  
> in the scoring calculation. No idea what that might be. I'll see if I  
> can find out.
>
> Wolfgang.
>
>> The test case is failing (type "ant test" at the contrib/memory  
>> working directory) with this:
>>
>>     [junit] Testcase:  
>> testMany(org.apache.lucene.index.memory.MemoryIndexTest): Caused an  
>> ERROR
>>     [junit] BUG DETECTED:69 at query=term AND NOT phrase term,  
>> file=src/java/org/apache/lucene/index/memory/MemoryIndex.java,  
>> anal=org.apache.lucene.analysis.SimpleAnalyzer@127b52
>>     [junit] java.lang.IllegalStateException: BUG DETECTED:69 at  
>> query=term AND NOT phrase term,  
>> file=src/java/org/apache/lucene/index/memory/MemoryIndex.java,  
>> anal=org.apache.lucene.analysis.SimpleAnalyzer@127b52
>>     [junit]     at  
>> org.apache.lucene.index.memory.MemoryIndexTest.run(MemoryIndexTest.jav 
>> a:305)
>>     [junit]     at  
>> org.apache.lucene.index.memory.MemoryIndexTest.testMany(MemoryIndexTes 
>> t.java:228)
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [Performance] Streaming main memory indexing of single strings

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Applied!!

     Erik

On May 3, 2005, at 1:31 PM, Wolfgang Hoschek wrote:

> Here's a performance patch for MemoryIndex.MemoryIndexReader that  
> caches the norms for a given field, avoiding repeated recomputation  
> of the norms. Recall that, depending on the query, norms() can be  
> called over and over again with mostly the same parameters. Thus,  
> replace public byte[] norms(String fieldName) with the following code:
>
>         /** performance hack: cache norms to avoid repeated  
> expensive calculations */
>         private byte[] cachedNorms;
>         private String cachedFieldName;
>         private Similarity cachedSimilarity;
>
>         public byte[] norms(String fieldName) {
>             byte[] norms = cachedNorms;
>             Similarity sim = getSimilarity();
>             if (fieldName != cachedFieldName || sim !=  
> cachedSimilarity) { // not cached?
>                 Info info = getInfo(fieldName);
>                 int numTokens = info != null ? info.numTokens : 0;
>                 float n = sim.lengthNorm(fieldName, numTokens);
>                 byte norm = Similarity.encodeNorm(n);
>                 norms = new byte[] {norm};
>
>                 cachedNorms = norms;
>                 cachedFieldName = fieldName;
>                 cachedSimilarity = sim;
>                 if (DEBUG) System.err.println 
> ("MemoryIndexReader.norms: " + fieldName + ":" + n + ":" + norm +  
> ":" + numTokens);
>             }
>             return norms;
>         }
>
>
> The effect can be substantial when measured with the profiler, so  
> it's worth it.
> Wolfgang.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [Performance] Streaming main memory indexing of single strings

Posted by Wolfgang Hoschek <wh...@lbl.gov>.
Here's a performance patch for MemoryIndex.MemoryIndexReader that 
caches the norms for a given field, avoiding repeated recomputation of 
the norms. Recall that, depending on the query, norms() can be called 
over and over again with mostly the same parameters. Thus, replace 
public byte[] norms(String fieldName) with the following code:

		/** performance hack: cache norms to avoid repeated expensive 
calculations */
		private byte[] cachedNorms;
		private String cachedFieldName;
		private Similarity cachedSimilarity;
		
		public byte[] norms(String fieldName) {
			byte[] norms = cachedNorms;
			Similarity sim = getSimilarity();
			if (fieldName != cachedFieldName || sim != cachedSimilarity) { // 
not cached?
				Info info = getInfo(fieldName);
				int numTokens = info != null ? info.numTokens : 0;
				float n = sim.lengthNorm(fieldName, numTokens);
				byte norm = Similarity.encodeNorm(n);
				norms = new byte[] {norm};
				
				cachedNorms = norms;
				cachedFieldName = fieldName;
				cachedSimilarity = sim;
				if (DEBUG) System.err.println("MemoryIndexReader.norms: " + 
fieldName + ":" + n + ":" + norm + ":" + numTokens);
			}
			return norms;
		}


The effect can be substantial when measured with the profiler, so it's 
worth it.
Wolfgang.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [Performance] Streaming main memory indexing of single strings

Posted by Wolfgang Hoschek <wh...@lbl.gov>.
Thanks!
Wolfgang.

> I've committed this change after it successfully worked for me.
>
> Thanks!
>
>     Erik
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [Performance] Streaming main memory indexing of single strings

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On May 2, 2005, at 5:21 PM, Wolfgang Hoschek wrote:

> Finally found and fixed the bug!
> The fix is simply to replace MemoryIndex.MemoryIndexReader skipTo()  
> with the following:
>
>                 public boolean skipTo(int target) {
>                     if (DEBUG) System.err.println(".skipTo: " +  
> target);
>                     return next();
>                 }
>
> Apparently lucene-1.4.3 didn't use skipTo() in a way that triggered  
> the bug, while SVN does.

I've committed this change after it successfully worked for me.

Thanks!

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [Performance] Streaming main memory indexing of single strings

Posted by Wolfgang Hoschek <wh...@lbl.gov>.
Finally found and fixed the bug!
The fix is simply to replace MemoryIndex.MemoryIndexReader skipTo()  
with the following:

				public boolean skipTo(int target) {
					if (DEBUG) System.err.println(".skipTo: " + target);
					return next();
				}

Apparently lucene-1.4.3 didn't use skipTo() in a way that triggered the  
bug, while SVN does.

I now ran the tests over a much larger set of documents and all tests  
pass. Give it a shot :-)
Wolfgang.


On May 2, 2005, at 9:05 AM, Wolfgang Hoschek wrote:

> I'm looking at it right now. The tests pass fine when you put  
> lucene-1.4.3.jar instead of the current lucene onto the classpath  
> which is what I've been doing so far. Something seems to have changed  
> in the scoring calculation. No idea what that might be. I'll see if I  
> can find out.
>
> Wolfgang.
>
>> The test case is failing (type "ant test" at the contrib/memory  
>> working directory) with this:
>>
>>     [junit] Testcase:  
>> testMany(org.apache.lucene.index.memory.MemoryIndexTest): Caused an  
>> ERROR
>>     [junit] BUG DETECTED:69 at query=term AND NOT phrase term,  
>> file=src/java/org/apache/lucene/index/memory/MemoryIndex.java,  
>> anal=org.apache.lucene.analysis.SimpleAnalyzer@127b52
>>     [junit] java.lang.IllegalStateException: BUG DETECTED:69 at  
>> query=term AND NOT phrase term,  
>> file=src/java/org/apache/lucene/index/memory/MemoryIndex.java,  
>> anal=org.apache.lucene.analysis.SimpleAnalyzer@127b52
>>     [junit]     at  
>> org.apache.lucene.index.memory.MemoryIndexTest.run(MemoryIndexTest.jav 
>> a:305)
>>     [junit]     at  
>> org.apache.lucene.index.memory.MemoryIndexTest.testMany(MemoryIndexTes 
>> t.java:228)
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [Performance] Streaming main memory indexing of single strings

Posted by Wolfgang Hoschek <wh...@lbl.gov>.
I'm looking at it right now. The tests pass fine when you put  
lucene-1.4.3.jar instead of the current lucene onto the classpath which  
is what I've been doing so far. Something seems to have changed in the  
scoring calculation. No idea what that might be. I'll see if I can find  
out.

Wolfgang.

> The test case is failing (type "ant test" at the contrib/memory  
> working directory) with this:
>
>     [junit] Testcase:  
> testMany(org.apache.lucene.index.memory.MemoryIndexTest): Caused an  
> ERROR
>     [junit] BUG DETECTED:69 at query=term AND NOT phrase term,  
> file=src/java/org/apache/lucene/index/memory/MemoryIndex.java,  
> anal=org.apache.lucene.analysis.SimpleAnalyzer@127b52
>     [junit] java.lang.IllegalStateException: BUG DETECTED:69 at  
> query=term AND NOT phrase term,  
> file=src/java/org/apache/lucene/index/memory/MemoryIndex.java,  
> anal=org.apache.lucene.analysis.SimpleAnalyzer@127b52
>     [junit]     at  
> org.apache.lucene.index.memory.MemoryIndexTest.run(MemoryIndexTest.java 
> :305)
>     [junit]     at  
> org.apache.lucene.index.memory.MemoryIndexTest.testMany(MemoryIndexTest 
> .java:228)
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [Performance] Streaming main memory indexing of single strings

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On May 1, 2005, at 10:20 PM, Wolfgang Hoschek wrote:

> I've uploaded code that now runs against the current SVN, plus junit  
> test cases, plus some minor internal updates to the functionality  
> itself. For details see  
> http://issues.apache.org/bugzilla/show_bug.cgi?id=34585
>
> Be prepared for the testcases to take some minutes to complete - don't  
> hit CTRL-C :-)
> Erik, if nobody objects, can you please put this into a contrib area,  
> e.g. module "memory" in org.apache.lucene.index.memory, or similar?

I have committed it into contrib/memory.  I made a few minor tweaks  
such as 2005 for year in license header, putting package statement  
above license, and adjusting the paths in the test case to match our  
standard src/test and src/java structure.

The test case is failing (type "ant test" at the contrib/memory working  
directory) with this:

     [junit] Testcase:  
testMany(org.apache.lucene.index.memory.MemoryIndexTest): Caused an  
ERROR
     [junit] BUG DETECTED:69 at query=term AND NOT phrase term,  
file=src/java/org/apache/lucene/index/memory/MemoryIndex.java,  
anal=org.apache.lucene.analysis.SimpleAnalyzer@127b52
     [junit] java.lang.IllegalStateException: BUG DETECTED:69 at  
query=term AND NOT phrase term,  
file=src/java/org/apache/lucene/index/memory/MemoryIndex.java,  
anal=org.apache.lucene.analysis.SimpleAnalyzer@127b52
     [junit]     at  
org.apache.lucene.index.memory.MemoryIndexTest.run(MemoryIndexTest.java: 
305)
     [junit]     at  
org.apache.lucene.index.memory.MemoryIndexTest.testMany(MemoryIndexTest. 
java:228)

Your conversion to a JUnit test case was not quite what I had in mind  
:)  You simply wrapped your main() into a testMany method.  But it is  
fine for now as it is easily converted into more granular testXXX  
methods that use the JUnit assert* methods.  The paths to test files  
will likely need to be parameterized and passed in from Ant's <junit>  
task via system properties in order to run correctly regardless of  
working directory.  These things are easily tweaked though and not  
worth holding back the initial commit.

Again, I'm impressed with your level of javadocs and thoroughness in  
the code.  Good stuff!

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org