You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Karl Øie <ka...@gan.no> on 2002/01/28 18:23:22 UTC

RE: strange search problems(SOLVED!)

....er, after i sent the previous mail i grep'ed trough the source for
"10000" and found this:


jakarta-lucene/src/java/org/apache/lucene/index/IndexWriter.java


  /** The maximum number of terms that will be indexed for a single field in
a
    document.  This limits the amount of memory required for indexing, so
that
    collections with very large files will not crash the indexing process by
    running out of memory.

    <p>By default, no more than 10,000 terms will be indexed for a field. */
  public int maxFieldLength = 10000;


gentlemen, you may throw your tomatoes now!... sorry to bother you!


mvh karl øie



-----Original Message-----
From: Karl Øie [mailto:karl@gan.no]
Sent: 28. januar 2002 18:16
To: lucene-user@jakarta.apache.org
Subject: strange search problems(cannot query for more than the first
10000 words!?!)


I have created a testclass for working with Analyzers and ran into a strange
problem; I cannot search for text in fields with more than 10000 words!?!?

I have tested for various bugs in my test class, but I cannot find anything
there (please have a look, files are attached).


the class "AnalayzerTest" can be used like this:

"java -cp lucene-1.2-rc3-dev.jar org.apache.lucene.analysis.AnalyzerTest
voc.txt voc_out.txt"

where the "voc.txt" and "voc_out.txt" also are included in the zip file.


The approach is simple: voc.txt contains 20628 Norwegian words, to test the
Analyzer I try to do this:

- create a string containing all the 20628 words separated with " ".
- create a lucene document and index this string as a text field.
- add this one document to an index
- loop trough the words again and query the index for each of the same words
in the list.
- if everything works every word should yield a hit in the single document
that exist in the index.


To be sure nothing is filtered I have used the WhitespaceAnalyzer analyzer
(or NullAnalyzer...).


But here comes the problems:
----------------------------

If I try to run all the 20628 words, the last 10628 words can not be found
by the IndexSearcher. If I flip the words around(reverse alpha-order). I
cannot find the 10628 first words!!.

If I limit the wordlist to 10000, I get a perfect match for either the first
or last 10000 words. If I set the limit to 10005 I will get 5 words not
found at the beginning or end of the list according to order.


Does anyone know what's going on here?? I would be very happy if someone
could point to a place in my code where I have done something really stupid,
because I have tried to track this for a hole day.


mvh karl øie/gan media


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>