You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Wolfgang Hoschek <wh...@lbl.gov> on 2005/05/02 04:20:47 UTC
Re: [Performance] Streaming main memory indexing of single strings
I've uploaded code that now runs against the current SVN, plus junit
test cases, plus some minor internal updates to the functionality
itself. For details see
http://issues.apache.org/bugzilla/show_bug.cgi?id=34585
Be prepared for the testcases to take some minutes to complete - don't
hit CTRL-C :-)
Erik, if nobody objects, can you please put this into a contrib area,
e.g. module "memory" in org.apache.lucene.index.memory, or similar?
Thanks,
Wolfgang.
On Apr 27, 2005, at 10:30 AM, Erik Hatcher wrote:
> On Apr 27, 2005, at 12:22 PM, Doug Cutting wrote:
>> Erik Hatcher wrote:
>>> I'm not quite sure where to put MemoryIndex - maybe it deserves to
>>> stand on its own in a new contrib area?
>>
>> That sounds good to me.
>
> Ok... once Wolfgang gives me one last round up updates (JUnit tests
> instead of main() and upgrade it to work with trunk) I'll do that. I
> had put it in miscellaneous but will create its only sub-contrib area
> instead.
>
>>
>>> Or does it make sense to put this into misc (still in
>>> sandbox/misc)? Or where?
>>
>> Isn't the goal for sandbox/ to go away, replaced with contrib/?
>
> Yes. In fact, I moved the last relevant piece
> (sandbox/contributions/miscellaneous) to contrib last night. I think
> both the parsers and XML-Indexing-Demo found in the sandbox are not
> worth preserving. Anyone feel that these pieces left in the sandbox
> should be preserved?
>
> Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Wolfgang Hoschek <wh...@lbl.gov>.
>>
>> The version I sent returns in O(1), if performance was your concern.
>> Or
>> did you mean something else?
>
> Since 0 is the only document number in the index, a
>
> return target == 0;
>
> might be nice for skipTo(). It doesn't really help performance, though,
> and the next() works just as well.
>
> Regards,
> Paul Elschot.
>
It's not just "return target == 0". Internally next() switches a
hasNext flag to false, and that makes it a safer operation...
BTW, did you give the unit tests a shot? Or even better, run it against
some of your own queries/test data? That might help to shake out other
bugs that might potentially be lurking in remote corners...
Cheers,
Wolfgang.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Paul Elschot <pa...@xs4all.nl>.
On Monday 02 May 2005 23:38, Wolfgang Hoschek wrote:
> > Yes, the svn trunk uses skipTo more often than 1.4.3.
> >
> > However, your implementation of skipTo() needs some improvement.
> > See the javadoc of skipTo of class Scorer:
> >
> > http://lucene.apache.org/java/docs/api/org/apache/lucene/search/
> > Scorer.html#skipTo(int)
>
> What's wrong with the version I sent? Remeber that there can be at most
> one document in a MemoryIndex. So the "target" parameter can safely be
> ignored, as far as I can see.
Correct, I did not realize that there is only a single doc in the index.
>
> >
> > In case the underlying scorers provide skipTo() it's even better to
> > use that.
> >
>
> The version I sent returns in O(1), if performance was your concern. Or
> did you mean something else?
Since 0 is the only document number in the index, a
return target == 0;
might be nice for skipTo(). It doesn't really help performance, though,
and the next() works just as well.
Regards,
Paul Elschot.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Wolfgang Hoschek <wh...@lbl.gov>.
> Yes, the svn trunk uses skipTo more often than 1.4.3.
>
> However, your implementation of skipTo() needs some improvement.
> See the javadoc of skipTo of class Scorer:
>
> http://lucene.apache.org/java/docs/api/org/apache/lucene/search/
> Scorer.html#skipTo(int)
What's wrong with the version I sent? Remeber that there can be at most
one document in a MemoryIndex. So the "target" parameter can safely be
ignored, as far as I can see.
>
> In case the underlying scorers provide skipTo() it's even better to
> use that.
>
The version I sent returns in O(1), if performance was your concern. Or
did you mean something else?
Wolfgang.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Paul Elschot <pa...@xs4all.nl>.
Wolfgang,
On Monday 02 May 2005 23:21, Wolfgang Hoschek wrote:
> Finally found and fixed the bug!
> The fix is simply to replace MemoryIndex.MemoryIndexReader skipTo()
> with the following:
>
> public boolean skipTo(int target) {
> if (DEBUG) System.err.println(".skipTo: " + target);
> return next();
> }
>
> Apparently lucene-1.4.3 didn't use skipTo() in a way that triggered the
> bug, while SVN does.
Yes, the svn trunk uses skipTo more often than 1.4.3.
However, your implementation of skipTo() needs some improvement.
See the javadoc of skipTo of class Scorer:
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Scorer.html#skipTo(int)
In case the underlying scorers provide skipTo() it's even better to use that.
Regards,
Paul Elschot
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Wolfgang Hoschek <wh...@lbl.gov>.
This is what I have as scoring calculation, and it seems to do exactly
what lucene-1.4.3 does because the tests pass.
public byte[] norms(String fieldName) {
if (DEBUG) System.err.println("MemoryIndexReader.norms: " +
fieldName);
Info info = getInfo(fieldName);
int numTokens = info != null ? info.numTokens : 0;
byte norm =
Similarity.encodeNorm(getSimilarity().lengthNorm(fieldName,
numTokens));
return new byte[] {norm};
}
public void norms(String fieldName, byte[] bytes, int offset) {
if (DEBUG) System.err.println("MemoryIndexReader.norms: " +
fieldName + "*");
byte[] norms = norms(fieldName);
System.arraycopy(norms, 0, bytes, offset, norms.length);
}
private Similarity getSimilarity() {
return searcher.getSimilarity(); // this is the normal lucene
IndexSearcher
}
Can anyone see what's wrong with it for lucene current SVN? Should my
calculation now be done differently? If so, how?
Thanks for any clues into the right direction.
Wolfgang.
On May 2, 2005, at 9:05 AM, Wolfgang Hoschek wrote:
> I'm looking at it right now. The tests pass fine when you put
> lucene-1.4.3.jar instead of the current lucene onto the classpath
> which is what I've been doing so far. Something seems to have changed
> in the scoring calculation. No idea what that might be. I'll see if I
> can find out.
>
> Wolfgang.
>
>> The test case is failing (type "ant test" at the contrib/memory
>> working directory) with this:
>>
>> [junit] Testcase:
>> testMany(org.apache.lucene.index.memory.MemoryIndexTest): Caused an
>> ERROR
>> [junit] BUG DETECTED:69 at query=term AND NOT phrase term,
>> file=src/java/org/apache/lucene/index/memory/MemoryIndex.java,
>> anal=org.apache.lucene.analysis.SimpleAnalyzer@127b52
>> [junit] java.lang.IllegalStateException: BUG DETECTED:69 at
>> query=term AND NOT phrase term,
>> file=src/java/org/apache/lucene/index/memory/MemoryIndex.java,
>> anal=org.apache.lucene.analysis.SimpleAnalyzer@127b52
>> [junit] at
>> org.apache.lucene.index.memory.MemoryIndexTest.run(MemoryIndexTest.jav
>> a:305)
>> [junit] at
>> org.apache.lucene.index.memory.MemoryIndexTest.testMany(MemoryIndexTes
>> t.java:228)
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Applied!!
Erik
On May 3, 2005, at 1:31 PM, Wolfgang Hoschek wrote:
> Here's a performance patch for MemoryIndex.MemoryIndexReader that
> caches the norms for a given field, avoiding repeated recomputation
> of the norms. Recall that, depending on the query, norms() can be
> called over and over again with mostly the same parameters. Thus,
> replace public byte[] norms(String fieldName) with the following code:
>
> /** performance hack: cache norms to avoid repeated
> expensive calculations */
> private byte[] cachedNorms;
> private String cachedFieldName;
> private Similarity cachedSimilarity;
>
> public byte[] norms(String fieldName) {
> byte[] norms = cachedNorms;
> Similarity sim = getSimilarity();
> if (fieldName != cachedFieldName || sim !=
> cachedSimilarity) { // not cached?
> Info info = getInfo(fieldName);
> int numTokens = info != null ? info.numTokens : 0;
> float n = sim.lengthNorm(fieldName, numTokens);
> byte norm = Similarity.encodeNorm(n);
> norms = new byte[] {norm};
>
> cachedNorms = norms;
> cachedFieldName = fieldName;
> cachedSimilarity = sim;
> if (DEBUG) System.err.println
> ("MemoryIndexReader.norms: " + fieldName + ":" + n + ":" + norm +
> ":" + numTokens);
> }
> return norms;
> }
>
>
> The effect can be substantial when measured with the profiler, so
> it's worth it.
> Wolfgang.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Wolfgang Hoschek <wh...@lbl.gov>.
Here's a performance patch for MemoryIndex.MemoryIndexReader that
caches the norms for a given field, avoiding repeated recomputation of
the norms. Recall that, depending on the query, norms() can be called
over and over again with mostly the same parameters. Thus, replace
public byte[] norms(String fieldName) with the following code:
/** performance hack: cache norms to avoid repeated expensive
calculations */
private byte[] cachedNorms;
private String cachedFieldName;
private Similarity cachedSimilarity;
public byte[] norms(String fieldName) {
byte[] norms = cachedNorms;
Similarity sim = getSimilarity();
if (fieldName != cachedFieldName || sim != cachedSimilarity) { //
not cached?
Info info = getInfo(fieldName);
int numTokens = info != null ? info.numTokens : 0;
float n = sim.lengthNorm(fieldName, numTokens);
byte norm = Similarity.encodeNorm(n);
norms = new byte[] {norm};
cachedNorms = norms;
cachedFieldName = fieldName;
cachedSimilarity = sim;
if (DEBUG) System.err.println("MemoryIndexReader.norms: " +
fieldName + ":" + n + ":" + norm + ":" + numTokens);
}
return norms;
}
The effect can be substantial when measured with the profiler, so it's
worth it.
Wolfgang.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Wolfgang Hoschek <wh...@lbl.gov>.
Thanks!
Wolfgang.
> I've committed this change after it successfully worked for me.
>
> Thanks!
>
> Erik
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On May 2, 2005, at 5:21 PM, Wolfgang Hoschek wrote:
> Finally found and fixed the bug!
> The fix is simply to replace MemoryIndex.MemoryIndexReader skipTo()
> with the following:
>
> public boolean skipTo(int target) {
> if (DEBUG) System.err.println(".skipTo: " +
> target);
> return next();
> }
>
> Apparently lucene-1.4.3 didn't use skipTo() in a way that triggered
> the bug, while SVN does.
I've committed this change after it successfully worked for me.
Thanks!
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Wolfgang Hoschek <wh...@lbl.gov>.
Finally found and fixed the bug!
The fix is simply to replace MemoryIndex.MemoryIndexReader skipTo()
with the following:
public boolean skipTo(int target) {
if (DEBUG) System.err.println(".skipTo: " + target);
return next();
}
Apparently lucene-1.4.3 didn't use skipTo() in a way that triggered the
bug, while SVN does.
I now ran the tests over a much larger set of documents and all tests
pass. Give it a shot :-)
Wolfgang.
On May 2, 2005, at 9:05 AM, Wolfgang Hoschek wrote:
> I'm looking at it right now. The tests pass fine when you put
> lucene-1.4.3.jar instead of the current lucene onto the classpath
> which is what I've been doing so far. Something seems to have changed
> in the scoring calculation. No idea what that might be. I'll see if I
> can find out.
>
> Wolfgang.
>
>> The test case is failing (type "ant test" at the contrib/memory
>> working directory) with this:
>>
>> [junit] Testcase:
>> testMany(org.apache.lucene.index.memory.MemoryIndexTest): Caused an
>> ERROR
>> [junit] BUG DETECTED:69 at query=term AND NOT phrase term,
>> file=src/java/org/apache/lucene/index/memory/MemoryIndex.java,
>> anal=org.apache.lucene.analysis.SimpleAnalyzer@127b52
>> [junit] java.lang.IllegalStateException: BUG DETECTED:69 at
>> query=term AND NOT phrase term,
>> file=src/java/org/apache/lucene/index/memory/MemoryIndex.java,
>> anal=org.apache.lucene.analysis.SimpleAnalyzer@127b52
>> [junit] at
>> org.apache.lucene.index.memory.MemoryIndexTest.run(MemoryIndexTest.jav
>> a:305)
>> [junit] at
>> org.apache.lucene.index.memory.MemoryIndexTest.testMany(MemoryIndexTes
>> t.java:228)
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Wolfgang Hoschek <wh...@lbl.gov>.
I'm looking at it right now. The tests pass fine when you put
lucene-1.4.3.jar instead of the current lucene onto the classpath which
is what I've been doing so far. Something seems to have changed in the
scoring calculation. No idea what that might be. I'll see if I can find
out.
Wolfgang.
> The test case is failing (type "ant test" at the contrib/memory
> working directory) with this:
>
> [junit] Testcase:
> testMany(org.apache.lucene.index.memory.MemoryIndexTest): Caused an
> ERROR
> [junit] BUG DETECTED:69 at query=term AND NOT phrase term,
> file=src/java/org/apache/lucene/index/memory/MemoryIndex.java,
> anal=org.apache.lucene.analysis.SimpleAnalyzer@127b52
> [junit] java.lang.IllegalStateException: BUG DETECTED:69 at
> query=term AND NOT phrase term,
> file=src/java/org/apache/lucene/index/memory/MemoryIndex.java,
> anal=org.apache.lucene.analysis.SimpleAnalyzer@127b52
> [junit] at
> org.apache.lucene.index.memory.MemoryIndexTest.run(MemoryIndexTest.java
> :305)
> [junit] at
> org.apache.lucene.index.memory.MemoryIndexTest.testMany(MemoryIndexTest
> .java:228)
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On May 1, 2005, at 10:20 PM, Wolfgang Hoschek wrote:
> I've uploaded code that now runs against the current SVN, plus junit
> test cases, plus some minor internal updates to the functionality
> itself. For details see
> http://issues.apache.org/bugzilla/show_bug.cgi?id=34585
>
> Be prepared for the testcases to take some minutes to complete - don't
> hit CTRL-C :-)
> Erik, if nobody objects, can you please put this into a contrib area,
> e.g. module "memory" in org.apache.lucene.index.memory, or similar?
I have committed it into contrib/memory. I made a few minor tweaks
such as 2005 for year in license header, putting package statement
above license, and adjusting the paths in the test case to match our
standard src/test and src/java structure.
The test case is failing (type "ant test" at the contrib/memory working
directory) with this:
[junit] Testcase:
testMany(org.apache.lucene.index.memory.MemoryIndexTest): Caused an
ERROR
[junit] BUG DETECTED:69 at query=term AND NOT phrase term,
file=src/java/org/apache/lucene/index/memory/MemoryIndex.java,
anal=org.apache.lucene.analysis.SimpleAnalyzer@127b52
[junit] java.lang.IllegalStateException: BUG DETECTED:69 at
query=term AND NOT phrase term,
file=src/java/org/apache/lucene/index/memory/MemoryIndex.java,
anal=org.apache.lucene.analysis.SimpleAnalyzer@127b52
[junit] at
org.apache.lucene.index.memory.MemoryIndexTest.run(MemoryIndexTest.java:
305)
[junit] at
org.apache.lucene.index.memory.MemoryIndexTest.testMany(MemoryIndexTest.
java:228)
Your conversion to a JUnit test case was not quite what I had in mind
:) You simply wrapped your main() into a testMany method. But it is
fine for now as it is easily converted into more granular testXXX
methods that use the JUnit assert* methods. The paths to test files
will likely need to be parameterized and passed in from Ant's <junit>
task via system properties in order to run correctly regardless of
working directory. These things are easily tweaked though and not
worth holding back the initial commit.
Again, I'm impressed with your level of javadocs and thoroughness in
the code. Good stuff!
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org