You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Wolfgang Hoschek <wh...@lbl.gov> on 2005/04/13 22:11:12 UTC
[Performance] Streaming main memory indexing of single strings
Hi,
I'm wondering if anyone could let me know how to improve Lucene
performance for "streaming main memory indexing of single strings".
This would help to effectively integrate Lucene with the Nux XQuery
engine.
Below is a small microbenchmark simulating STREAMING XQuery fulltext
search as typical for XML network routers, message queuing system, P2P
networks, etc. In this on-the-fly main memory indexing scenario, each
individual string is immediately matched as soon as it becomes
available without any persistance involved. This usage scenario and
corresponding performance profile is quite different in comparison to
fulltext search over persistent (read-mostly) indexes.
The benchmark runs at some 3000 lucene queries/sec (lucene-1.4.3) which
is unfortunate news considering the XQuery engine can easily walk
hundreds of thousands of XML nodes per second. Ideally I'd like to run
at some 100000 queries/sec. Runnning this through the JDK 1.5 profiler
it seems that most time is spent in and below the following calls:
writer = new IndexWriter(dir, analyzer, true);
writer.addDocument(...);
writer.close();
I tried quite a few variants of the benchmark with various options,
unfortunately with little or no effect.
Lucene just does not seem to designed to do this sort of "transient
single string index" thing. All code paths related to opening, closing,
reading, writing, querying and object creation seem to be designed for
large persistent indexes.
Any advice on what I'm missing or what could be done about it would be
greatly appreciated.
Wolfgang.
P.S. the benchmark code is attached as a file below:
Re: [Performance] Streaming main memory indexing of single strings
Posted by Wolfgang Hoschek <wh...@lbl.gov>.
Thanks for pointing this out!
The overhead wasn't substantial it turns out. On closer
inspection/profiling there's more substantial (hidden) baggage when
subclassing IndexWriter. Would not suclassing IndexWriter have
similarly unexpected negative consequences?
Thanks,
Wolfgang.
On Apr 15, 2005, at 4:13 PM, Robert Engels wrote:
> You cannot do this as easily as it sounds. As I've pointed out on this
> list
> before, there are a multitude of places in the Query handling that
> assume an
> IndexReader is available. The Searcher interface doesn't really buy
> you much
> because the only available implementations are IndexSearcher, and that
> assumes an IndexReader implementation is available.
>
> To provide a pure implementation of Searcher you would need to
> reimplement
> all of Lucene.
>
> Trust me, stick with IndexReader subclassing (ideally, IndexReader
> would be
> an interface, and the core would be moved into AbstractIndexReader so
> projects like this would be much easier).
>
> Robert
>
> -----Original Message-----
> From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
> Sent: Friday, April 15, 2005 5:58 PM
> To: java-dev@lucene.apache.org
> Subject: Re: [Performance] Streaming main memory indexing of single
> strings
>
>
> A primary reason for the tight API proposed below is that it allows for
> maximum efficiency (which is the point of the exercise in the first
> place):
>
> - One can extend Searcher rather than IndexReader: There's no need to
> subclass IndexReader and carry the baggage of the superclass doing
> locking and all sort of unnecessary stuff with its internal
> RAMDirectory.
> - Even more extreme: Don't extend Searcher but implement the
> functionality directly using low-level APIs. This avoids unnecessary
> baggage for collecting hits, etc.
>
> Wolfgang.
>
> On Apr 15, 2005, at 3:15 PM, Wolfgang Hoschek wrote:
>
>> Cool! For my use case it would need to be able to handle arbitrary
>> queries (previously parsed from a general lucene query string).
>> Something like:
>>
>> float match(String Text, Query query)
>>
>> it's fine with me if it also works for
>>
>> float[] match(String[] texts, Query query) or
>> float(Document doc, Query query)
>>
>> but that isn't required by the use case.
>>
>> Wolfgang.
>>
>>> I am intrigued by this and decided to mock a quick and dirty example
>>> of such an IndexReader. After a little trial-and-error I got it
>>> working at least for TermQuery and WildcardQuery. I've pasted my
>>> code below as an example, but there is much room for improvement,
>>> especially in terms of performance and also in keeping track of term
>>> frequency, and also it would be nicer if it handled the analysis
>>> internally.
>>>
>>> I think something like this would make a handy addition to our
>>> contrib area at least. I'd be happy to receive improvements to this
>>> and then add it to a contrib subproject.
>>>
>>> Perhaps this would be a handy way to handle situations where users
>>> have queries saved in a system and need to be alerted whenever a new
>>> document arrives matching the saved queries?
>>>
>>> Erik
>>>
>>>
>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
>>>> Sent: Thursday, April 14, 2005 4:04 PM
>>>> To: java-dev@lucene.apache.org
>>>> Subject: Re: [Performance] Streaming main memory indexing of single
>>>> strings
>>>>
>>>>
>>>> This seems to be a promising avenue worth exploring. My gutfeeling
>>>> is
>>>> that this could easily be 10-100 times faster.
>>>>
>>>> The drawback is that it requires a fair amount of understanding of
>>>> intricate Lucene internals, pulling those pieces together and
>>>> adapting
>>>> them as required for the seemingly simple "float match(String text,
>>>> Query query)".
>>>>
>>>> I might give it a shot but I'm not sure I'll be able to pull this
>>>> off!
>>>> Is there any similar code I could look at as a starting point?
>>>>
>>>> Wolfgang.
>>>>
>>>> On Apr 14, 2005, at 1:13 PM, Robert Engels wrote:
>>>>
>>>>> I think you are not approaching this the correct way.
>>>>>
>>>>> Pseudo code:
>>>>>
>>>>> Subclass IndexReader.
>>>>>
>>>>> Get tokens from String 'document' using Lucene analyzers.
>>>>>
>>>>> Build simple hash-map based data structures using tokens for terms,
>>>>> and term
>>>>> positions.
>>>>>
>>>>> reimplement termDocs() and termPositions() to use the structures
>>>>> from
>>>>> above.
>>>>>
>>>>> run searches.
>>>>>
>>>>> start again with next document.
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
>>>>> Sent: Thursday, April 14, 2005 2:56 PM
>>>>> To: java-dev@lucene.apache.org
>>>>> Subject: Re: [Performance] Streaming main memory indexing of single
>>>>> strings
>>>>>
>>>>>
>>>>> Otis, this might be a misunderstanding.
>>>>>
>>>>> - I'm not calling optimize(). That piece is commented out you if
>>>>> look
>>>>> again at the code.
>>>>> - The *streaming* use case requires that for each query I add one
>>>>> (and
>>>>> only one) document (aka string) to an empty index:
>>>>>
>>>>> repeat N times (where N is millions or billions):
>>>>> add a single string (aka document) to an empty index
>>>>> query the index
>>>>> drop index (or delete it's document)
>>>>>
>>>>> with the following API being called N times: float match(String
>>>>> text,
>>>>> Query query)
>>>>>
>>>>> So there's no possibility of adding many documents and thereafter
>>>>> running the query. This in turn seems to mean that the IndexWriter
>>>>> can't be kept open - unless I manually delete each document after
>>>>> each
>>>>> query to repeatedly reuse the RAMDirectory, which I've also tried
>>>>> before without any significant performance gain - deletion seems to
>>>>> have substantial overhead in itself. Perhaps it would be better if
>>>>> there were a Directory.deleteAllDocuments() or similar. Did you
>>>>> have
>>>>> some other approach in mind?
>>>>>
>>>>> As I said, Lucene's design doesn't seem to fit this streaming use
>>>>> case
>>>>> pattern well. In *this* scenario one could easily do without any
>>>>> locking, and without byte level organization in RAMDirectory and
>>>>> RAMFile, etc because a single small string isn't a large persistent
>>>>> multi-document index.
>>>>>
>>>>> For some background, here's a small example for the kind of XQuery
>>>>> functionality Nux/Lucene integration enables:
>>>>>
>>>>> (: An XQuery that finds all books authored by James that have
>>>>> something
>>>>> to do with "fish", sorted by relevance :)
>>>>> declare namespace lucene = "java:nux.xom.xquery.XQueryUtil";
>>>>> declare variable $query := "fish*~";
>>>>>
>>>>> for $book in /books/book[author="James" and lucene:match(string(.),
>>>>> $query) > 0.0]
>>>>> let $score := lucene:match(string($book), $query)
>>>>> order by $score descending
>>>>> return (<score>{$score}</score>, $book)
>>>>>
>>>>> More interestingly one can use this for classifying and routing XML
>>>>> messages based on rules (i.e. queries) inspecting their content...
>>>>>
>>>>> Any other clues about potential improvements would be greatly
>>>>> appreciated.
>>>>>
>>>>> Wolfgang.
>>>>>
>>>>> On Apr 13, 2005, at 10:09 PM, Otis Gospodnetic wrote:
>>>>>
>>>>>> It looks like you are calling that IndexWriter code in some loops,
>>>>>> opening it and closing it in every iteration of the loop and also
>>>>>> calling optimize. All of those things could be improved.
>>>>>> Keep your IndexWriter open, don't close it, and optimize the index
>>>>>> only
>>>>>> once you are done adding documents to it.
>>>>>>
>>>>>> See the highlights and the snipets in the first hit:
>>>>>> http://www.lucenebook.com/search?query=when+to+optimize
>>>>>>
>>>>>> Otis
>>>>>>
>>>>>>
>>>>>> --- Wolfgang Hoschek <wh...@lbl.gov> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm wondering if anyone could let me know how to improve Lucene
>>>>>>> performance for "streaming main memory indexing of single
>>>>>>> strings".
>>>>>>> This would help to effectively integrate Lucene with the Nux
>>>>>>> XQuery
>>>>>>> engine.
>>>>>>>
>>>>>>> Below is a small microbenchmark simulating STREAMING XQuery
>>>>>>> fulltext
>>>>>>> search as typical for XML network routers, message queuing
>>>>>>> system,
>>>>>>> P2P
>>>>>>> networks, etc. In this on-the-fly main memory indexing scenario,
>>>>>>> each
>>>>>>>
>>>>>>> individual string is immediately matched as soon as it becomes
>>>>>>> available without any persistance involved. This usage scenario
>>>>>>> and
>>>>>>> corresponding performance profile is quite different in
>>>>>>> comparison to
>>>>>>>
>>>>>>> fulltext search over persistent (read-mostly) indexes.
>>>>>>>
>>>>>>> The benchmark runs at some 3000 lucene queries/sec (lucene-1.4.3)
>>>>>>> which
>>>>>>> is unfortunate news considering the XQuery engine can easily walk
>>>>>>> hundreds of thousands of XML nodes per second. Ideally I'd like
>>>>>>> to
>>>>>>> run
>>>>>>> at some 100000 queries/sec. Runnning this through the JDK 1.5
>>>>>>> profiler
>>>>>>> it seems that most time is spent in and below the following
>>>>>>> calls:
>>>>>>>
>>>>>>> writer = new IndexWriter(dir, analyzer, true);
>>>>>>> writer.addDocument(...);
>>>>>>> writer.close();
>>>>>>>
>>>>>>> I tried quite a few variants of the benchmark with various
>>>>>>> options,
>>>>>>> unfortunately with little or no effect.
>>>>>>> Lucene just does not seem to designed to do this sort of
>>>>>>> "transient
>>>>>>> single string index" thing. All code paths related to opening,
>>>>>>> closing,
>>>>>>> reading, writing, querying and object creation seem to be
>>>>>>> designed
>>>>>>> for
>>>>>>> large persistent indexes.
>>>>>>>
>>>>>>> Any advice on what I'm missing or what could be done about it
>>>>>>> would
>>>>>>> be
>>>>>>> greatly appreciated.
>>>>>>>
>>>>>>> Wolfgang.
>>>>>>>
>>>>>>> P.S. the benchmark code is attached as a file below:
>>>>>>>
>>>>>>>> package nux.xom.pool;
>>>>>>>
>>>>>>> import java.io.IOException;
>>>>>>> //import java.io.Reader;
>>>>>>>
>>>>>>> import org.apache.lucene.analysis.Analyzer;
>>>>>>> //import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>> //import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>> //import org.apache.lucene.analysis.SimpleAnalyzer;
>>>>>>> //import org.apache.lucene.analysis.TokenStream;
>>>>>>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>>>>>>> import org.apache.lucene.document.Document;
>>>>>>> import org.apache.lucene.document.Field;
>>>>>>> //import org.apache.lucene.index.IndexReader;
>>>>>>> import org.apache.lucene.index.IndexWriter;
>>>>>>> import org.apache.lucene.queryParser.ParseException;
>>>>>>> import org.apache.lucene.queryParser.QueryParser;
>>>>>>> import org.apache.lucene.search.Hits;
>>>>>>> import org.apache.lucene.search.IndexSearcher;
>>>>>>> import org.apache.lucene.search.Query;
>>>>>>> import org.apache.lucene.search.Searcher;
>>>>>>> import org.apache.lucene.store.Directory;
>>>>>>> import org.apache.lucene.store.RAMDirectory;
>>>>>>>
>>>>>>> public final class LuceneMatcher { // TODO: make non-public
>>>>>>>
>>>>>>> private final Analyzer analyzer;
>>>>>>> // private final Directory dir = new RAMDirectory();
>>>>>>>
>>>>>>> public LuceneMatcher() {
>>>>>>> this(new StandardAnalyzer());
>>>>>>> // this(new SimpleAnalyzer());
>>>>>>> // this(new StopAnalyzer());
>>>>>>> // this(new Analyzer() {
>>>>>>> // public final TokenStream tokenStream(String fieldName,
>>>>>>> Reader
>>>>>>> reader) {
>>>>>>> // return new PorterStemFilter(new
>>>>>>> LowerCaseTokenizer(reader));
>>>>>>> // }
>>>>>>> // });
>>>>>>> }
>>>>>>>
>>>>>>> public LuceneMatcher(Analyzer analyzer) {
>>>>>>> if (analyzer == null)
>>>>>>> throw new IllegalArgumentException("analyzer must not be
>>>>>>> null");
>>>>>>> this.analyzer = analyzer;
>>>>>>> }
>>>>>>>
>>>>>>> public Query parseQuery(String expression) throws ParseException
>>>>>>> {
>>>>>>> QueryParser parser = new QueryParser("content", analyzer);
>>>>>>> // parser.setPhraseSlop(0);
>>>>>>> return parser.parse(expression);
>>>>>>> }
>>>>>>>
>>>>>>> /**
>>>>>>> * Returns the relevance score by matching the given index
>>>>>>> against
>>>>>>> the given
>>>>>>> * Lucene query expression. The index must not contain more than
>>>>>>> one
>>>>>>> Lucene
>>>>>>> * "document" (aka string to be searched).
>>>>>>> */
>>>>>>> public float match(Directory index, Query query) {
>>>>>>> Searcher searcher = null;
>>>>>>> try {
>>>>>>> searcher = new IndexSearcher(index);
>>>>>>> Hits hits = searcher.search(query);
>>>>>>> float score = hits.length() > 0 ? hits.score(0) : 0.0f;
>>>>>>> return score;
>>>>>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>>>>>> throw new RuntimeException(e);
>>>>>>> } finally {
>>>>>>> try {
>>>>>>> if (searcher != null) searcher.close();
>>>>>>> } catch (IOException e) { // should never happen
>>>>>>> (RAMDirectory)
>>>>>>> throw new RuntimeException(e);
>>>>>>> }
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> // public float match(String text, Query query) {
>>>>>>> // return match(createIndex(text), query);
>>>>>>> // }
>>>>>>>
>>>>>>> public Directory createIndex(String text) {
>>>>>>> Directory dir = new RAMDirectory();
>>>>>>> IndexWriter writer = null;
>>>>>>> try {
>>>>>>> writer = new IndexWriter(dir, analyzer, true);
>>>>>>> // writer.setUseCompoundFile(false);
>>>>>>> // writer.mergeFactor = 2;
>>>>>>> // writer.minMergeDocs = 1;
>>>>>>> // writer.maxMergeDocs = 1;
>>>>>>>
>>>>>>> writer.addDocument(createDocument(text));
>>>>>>> // writer.optimize();
>>>>>>> return dir;
>>>>>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>>>>>> throw new RuntimeException(e);
>>>>>>> } finally {
>>>>>>> try {
>>>>>>> if (writer != null) writer.close();
>>>>>>> } catch (IOException e) { // should never happen
>>>>>>> (RAMDirectory)
>>>>>>> throw new RuntimeException(e);
>>>>>>> }
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> private Document createDocument(String content) {
>>>>>>> Document doc = new Document();
>>>>>>> doc.add(Field.UnStored("content", content));
>>>>>>> // doc.add(Field.Text("x", content));
>>>>>>> return doc;
>>>>>>> }
>>>>>>>
>>>>>>> /**
>>>>>>> * Lucene microbenchmark simulating STREAMING XQuery fulltext
>>>>>>> search
>>>>>>> as
>>>>>>> * typical for XML network routers, message queuing system, P2P
>>>>>>> networks,
>>>>>>> * etc. In this on-the-fly main memory indexing scenario, each
>>>>>>> individual
>>>>>>> * string is immediately matched as soon as it becomes available
>>>>>>> without any
>>>>>>> * persistance involved. This usage scenario and corresponding
>>>>>>> performance
>>>>>>> * profile is quite different in comparison to fulltext search
>>>>>>> over
>>>>>>> * persistent (read-mostly) indexes.
>>>>>>> *
>>>>>>> * Example XPath:
>>>>>>> count(/table/row[lucene:match(string(./firstname),
>>>>>>> * "James") > 0.0])
>>>>>>> */
>>>>>>> public static void main(String[] args) throws Exception {
>>>>>>> int k = -1;
>>>>>>> int runs = 5;
>>>>>>> if (args.length > ++k) runs = Integer.parseInt(args[k]);
>>>>>>>
>>>>>>> int nodes = 10000;
>>>>>>> if (args.length > ++k) nodes = Integer.parseInt(args[k]);
>>>>>>>
>>>>>>> String content = "James is out in the woods";
>>>>>>> if (args.length > ++k) content = args[k];
>>>>>>>
>>>>>>> String expression = "James";
>>>>>>> if (args.length > ++k) expression = args[k];
>>>>>>>
>>>>>>> LuceneMatcher matcher = new LuceneMatcher();
>>>>>>> Query query = matcher.parseQuery(expression); // to be reused N
>>>>>>> times
>>>>>>>
>>>>>>> for (int r = 0; r < runs; r++) {
>>>>>>> long start = System.currentTimeMillis();
>>>>>>> int matches = 0;
>>>>>>>
>>>>>>> for (int i = 0; i < nodes; i++) {
>>>>>>> // if (LuceneUtil.match(content + i, expression) > 0.0f) {
>>>>>>> if (matcher.match(matcher.createIndex(content + i), query) >
>>>>>>> 0.0f) {
>>>>>>> matches++;
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> long end = System.currentTimeMillis();
>>>>>>> System.out.println("matches=" + matches);
>>>>>>> System.out.println("secs=" + ((end-start) / 1000.0f));
>>>>>>> System.out.println("queries/sec=" + (nodes / ((end-start) /
>>>>>>> 1000.0f)));
>>>>>>> System.out.println();
>>>>>>> }
>>>>>>> }
>>>>>>> }
>>>
>>> public class StringIndexReader extends IndexReader {
>>> private List terms;
>>> public StringIndexReader(String strings[]) {
>>> super(null);
>>> terms = Arrays.asList(strings);
>>> Collections.sort(terms);
>>> }
>>>
>>> public TermFreqVector[] getTermFreqVectors(int docNumber) throws
>>> IOException {
>>> System.out.println("StringIndexReader.getTermFreqVectors");
>>> return new TermFreqVector[0];
>>> }
>>>
>>> public TermFreqVector getTermFreqVector(int docNumber, String
>>> field) throws IOException {
>>> System.out.println("StringIndexReader.getTermFreqVector");
>>> return null;
>>> }
>>>
>>> public int numDocs() {
>>> System.out.println("StringIndexReader.numDocs");
>>> return 1;
>>> }
>>>
>>> public int maxDoc() {
>>> System.out.println("StringIndexReader.maxDoc");
>>> return 1;
>>> }
>>>
>>> public Document document(int n) throws IOException {
>>> System.out.println("StringIndexReader.document");
>>> return null;
>>> }
>>>
>>> public boolean isDeleted(int n) {
>>> System.out.println("StringIndexReader.isDeleted");
>>> return false;
>>> }
>>>
>>> public boolean hasDeletions() {
>>> System.out.println("StringIndexReader.hasDeletions");
>>> return false;
>>> }
>>>
>>> public byte[] norms(String field) throws IOException {
>>> // TODO: what value to use for this?
>>> System.out.println("StringIndexReader.norms: " + field);
>>> return new byte[] { 1 };
>>> }
>>>
>>> public void norms(String field, byte[] bytes, int offset) throws
>>> IOException {
>>> System.out.println("StringIndexReader.norms: " + field + "*");
>>> }
>>>
>>> protected void doSetNorm(int doc, String field, byte value) throws
>>> IOException {
>>> System.out.println("StringIndexReader.doSetNorm");
>>>
>>> }
>>>
>>> public TermEnum terms() throws IOException {
>>> System.out.println("StringIndexReader.terms");
>>> return terms(null);
>>> }
>>>
>>> public TermEnum terms(final Term term) throws IOException {
>>> System.out.println("StringIndexReader.terms: " + term);
>>>
>>> TermEnum termEnum = new TermEnum() {
>>> private String currentTerm;
>>> private Iterator iter;
>>>
>>> public boolean next() {
>>> System.out.println("TermEnum.next");
>>> if (iter.hasNext())
>>> currentTerm = (String) iter.next();
>>> return iter.hasNext();
>>> }
>>>
>>> public Term term() {
>>> if (iter == null) {
>>> iter = terms.iterator();
>>> while (next()) {
>>> if (currentTerm.startsWith(term.text()))
>>> break;
>>> }
>>> }
>>> System.out.println("TermEnum.term: " + currentTerm);
>>> return new Term(term.field(), currentTerm);
>>> }
>>>
>>> public int docFreq() {
>>> System.out.println("TermEnum.docFreq");
>>> return 1;
>>> }
>>>
>>> public void close() {
>>> System.out.println("TermEnum.close");
>>> }
>>> };
>>> return termEnum;
>>> }
>>>
>>> public int docFreq(Term term) throws IOException {
>>> System.out.println("StringIndexReader.docFreq: " + term);
>>> return terms.contains(term.text()) ? 1 : 0;
>>> }
>>>
>>> public TermDocs termDocs() throws IOException {
>>> System.out.println("StringIndexReader.termDocs");
>>>
>>> TermDocs td = new TermDocs() {
>>> private boolean done = false;
>>> String currentTerm;
>>>
>>> public void seek(Term term) {
>>> System.out.println(".seek: " + term);
>>> currentTerm = term.text();
>>> done = false;
>>> }
>>>
>>> public void seek(TermEnum termEnum) {
>>> seek(termEnum.term());
>>> }
>>>
>>> public int doc() {
>>> System.out.println(".doc");
>>> return 0;
>>> }
>>>
>>> public int freq() {
>>> System.out.println(".freq");
>>> return 1;
>>> }
>>>
>>> public boolean next() {
>>> System.out.println(".next");
>>> return false;
>>> }
>>>
>>> public int read(int[] docs, int[] freqs) {
>>> System.out.println(".read: " + docs.length);
>>>
>>> if (done) return 0;
>>>
>>> done = true;
>>> docs[0] = 0;
>>> freqs[0] = freq();
>>> return 1;
>>> }
>>>
>>> public boolean skipTo(int target) {
>>> System.out.println(".skipTo");
>>> return false;
>>> }
>>>
>>> public void close() {
>>> System.out.println(".close");
>>>
>>> }
>>> };
>>> return td;
>>> }
>>>
>>> public TermPositions termPositions() throws IOException {
>>> System.out.println("StringIndexReader.termPositions");
>>> return null;
>>> }
>>>
>>> protected void doDelete(int docNum) throws IOException {
>>> System.out.println("StringIndexReader.doDelete");
>>>
>>> }
>>>
>>> protected void doUndeleteAll() throws IOException {
>>> System.out.println("StringIndexReader.doUndeleteAll");
>>>
>>> }
>>>
>>> protected void doCommit() throws IOException {
>>> System.out.println("StringIndexReader.doCommit");
>>>
>>> }
>>>
>>> protected void doClose() throws IOException {
>>> System.out.println("StringIndexReader.doClose");
>>>
>>> }
>>>
>>> public Collection getFieldNames() throws IOException {
>>> System.out.println("StringIndexReader.getFieldNames");
>>> return null;
>>> }
>>>
>>> public Collection getFieldNames(boolean indexed) throws IOException
>>> {
>>> System.out.println("StringIndexReader.getFieldNames");
>>> return null;
>>> }
>>>
>>> public Collection getIndexedFieldNames(Field.TermVector tvSpec) {
>>> System.out.println("StringIndexReader.getIndexedFieldNames");
>>> return null;
>>> }
>>>
>>> public Collection getFieldNames(FieldOption fldOption) {
>>> System.out.println("StringIndexReader.getFieldNames");
>>> return null;
>>> }
>>>
>>> public static void main(String[] args) {
>>> IndexReader reader = new StringIndexReader(new String[] {"foo",
>>> "bar", "baz"});
>>> IndexSearcher searcher = new IndexSearcher(reader);
>>>
>>> Hits hits = null;
>>> try {
>>> hits = searcher.search(new WildcardQuery(new
>>> Term("field","ba*")));
>>> } catch (IOException e) {
>>> e.printStackTrace();
>>> }
>>> System.out.println("found " + hits.length());
>>> }
>>> }
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
RE: [Performance] Streaming main memory indexing of single strings
Posted by Robert Engels <re...@ix.netcom.com>.
You cannot do this as easily as it sounds. As I've pointed out on this list
before, there are a multitude of places in the Query handling that assume an
IndexReader is available. The Searcher interface doesn't really buy you much
because the only available implementations are IndexSearcher, and that
assumes an IndexReader implementation is available.
To provide a pure implementation of Searcher you would need to reimplement
all of Lucene.
Trust me, stick with IndexReader subclassing (ideally, IndexReader would be
an interface, and the core would be moved into AbstractIndexReader so
projects like this would be much easier).
Robert
-----Original Message-----
From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
Sent: Friday, April 15, 2005 5:58 PM
To: java-dev@lucene.apache.org
Subject: Re: [Performance] Streaming main memory indexing of single
strings
A primary reason for the tight API proposed below is that it allows for
maximum efficiency (which is the point of the exercise in the first
place):
- One can extend Searcher rather than IndexReader: There's no need to
subclass IndexReader and carry the baggage of the superclass doing
locking and all sort of unnecessary stuff with its internal
RAMDirectory.
- Even more extreme: Don't extend Searcher but implement the
functionality directly using low-level APIs. This avoids unnecessary
baggage for collecting hits, etc.
Wolfgang.
On Apr 15, 2005, at 3:15 PM, Wolfgang Hoschek wrote:
> Cool! For my use case it would need to be able to handle arbitrary
> queries (previously parsed from a general lucene query string).
> Something like:
>
> float match(String Text, Query query)
>
> it's fine with me if it also works for
>
> float[] match(String[] texts, Query query) or
> float(Document doc, Query query)
>
> but that isn't required by the use case.
>
> Wolfgang.
>
>> I am intrigued by this and decided to mock a quick and dirty example
>> of such an IndexReader. After a little trial-and-error I got it
>> working at least for TermQuery and WildcardQuery. I've pasted my
>> code below as an example, but there is much room for improvement,
>> especially in terms of performance and also in keeping track of term
>> frequency, and also it would be nicer if it handled the analysis
>> internally.
>>
>> I think something like this would make a handy addition to our
>> contrib area at least. I'd be happy to receive improvements to this
>> and then add it to a contrib subproject.
>>
>> Perhaps this would be a handy way to handle situations where users
>> have queries saved in a system and need to be alerted whenever a new
>> document arrives matching the saved queries?
>>
>> Erik
>>
>>
>>
>>>
>>>
>>> -----Original Message-----
>>> From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
>>> Sent: Thursday, April 14, 2005 4:04 PM
>>> To: java-dev@lucene.apache.org
>>> Subject: Re: [Performance] Streaming main memory indexing of single
>>> strings
>>>
>>>
>>> This seems to be a promising avenue worth exploring. My gutfeeling is
>>> that this could easily be 10-100 times faster.
>>>
>>> The drawback is that it requires a fair amount of understanding of
>>> intricate Lucene internals, pulling those pieces together and
>>> adapting
>>> them as required for the seemingly simple "float match(String text,
>>> Query query)".
>>>
>>> I might give it a shot but I'm not sure I'll be able to pull this
>>> off!
>>> Is there any similar code I could look at as a starting point?
>>>
>>> Wolfgang.
>>>
>>> On Apr 14, 2005, at 1:13 PM, Robert Engels wrote:
>>>
>>>> I think you are not approaching this the correct way.
>>>>
>>>> Pseudo code:
>>>>
>>>> Subclass IndexReader.
>>>>
>>>> Get tokens from String 'document' using Lucene analyzers.
>>>>
>>>> Build simple hash-map based data structures using tokens for terms,
>>>> and term
>>>> positions.
>>>>
>>>> reimplement termDocs() and termPositions() to use the structures
>>>> from
>>>> above.
>>>>
>>>> run searches.
>>>>
>>>> start again with next document.
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
>>>> Sent: Thursday, April 14, 2005 2:56 PM
>>>> To: java-dev@lucene.apache.org
>>>> Subject: Re: [Performance] Streaming main memory indexing of single
>>>> strings
>>>>
>>>>
>>>> Otis, this might be a misunderstanding.
>>>>
>>>> - I'm not calling optimize(). That piece is commented out you if
>>>> look
>>>> again at the code.
>>>> - The *streaming* use case requires that for each query I add one
>>>> (and
>>>> only one) document (aka string) to an empty index:
>>>>
>>>> repeat N times (where N is millions or billions):
>>>> add a single string (aka document) to an empty index
>>>> query the index
>>>> drop index (or delete it's document)
>>>>
>>>> with the following API being called N times: float match(String
>>>> text,
>>>> Query query)
>>>>
>>>> So there's no possibility of adding many documents and thereafter
>>>> running the query. This in turn seems to mean that the IndexWriter
>>>> can't be kept open - unless I manually delete each document after
>>>> each
>>>> query to repeatedly reuse the RAMDirectory, which I've also tried
>>>> before without any significant performance gain - deletion seems to
>>>> have substantial overhead in itself. Perhaps it would be better if
>>>> there were a Directory.deleteAllDocuments() or similar. Did you have
>>>> some other approach in mind?
>>>>
>>>> As I said, Lucene's design doesn't seem to fit this streaming use
>>>> case
>>>> pattern well. In *this* scenario one could easily do without any
>>>> locking, and without byte level organization in RAMDirectory and
>>>> RAMFile, etc because a single small string isn't a large persistent
>>>> multi-document index.
>>>>
>>>> For some background, here's a small example for the kind of XQuery
>>>> functionality Nux/Lucene integration enables:
>>>>
>>>> (: An XQuery that finds all books authored by James that have
>>>> something
>>>> to do with "fish", sorted by relevance :)
>>>> declare namespace lucene = "java:nux.xom.xquery.XQueryUtil";
>>>> declare variable $query := "fish*~";
>>>>
>>>> for $book in /books/book[author="James" and lucene:match(string(.),
>>>> $query) > 0.0]
>>>> let $score := lucene:match(string($book), $query)
>>>> order by $score descending
>>>> return (<score>{$score}</score>, $book)
>>>>
>>>> More interestingly one can use this for classifying and routing XML
>>>> messages based on rules (i.e. queries) inspecting their content...
>>>>
>>>> Any other clues about potential improvements would be greatly
>>>> appreciated.
>>>>
>>>> Wolfgang.
>>>>
>>>> On Apr 13, 2005, at 10:09 PM, Otis Gospodnetic wrote:
>>>>
>>>>> It looks like you are calling that IndexWriter code in some loops,
>>>>> opening it and closing it in every iteration of the loop and also
>>>>> calling optimize. All of those things could be improved.
>>>>> Keep your IndexWriter open, don't close it, and optimize the index
>>>>> only
>>>>> once you are done adding documents to it.
>>>>>
>>>>> See the highlights and the snipets in the first hit:
>>>>> http://www.lucenebook.com/search?query=when+to+optimize
>>>>>
>>>>> Otis
>>>>>
>>>>>
>>>>> --- Wolfgang Hoschek <wh...@lbl.gov> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm wondering if anyone could let me know how to improve Lucene
>>>>>> performance for "streaming main memory indexing of single
>>>>>> strings".
>>>>>> This would help to effectively integrate Lucene with the Nux
>>>>>> XQuery
>>>>>> engine.
>>>>>>
>>>>>> Below is a small microbenchmark simulating STREAMING XQuery
>>>>>> fulltext
>>>>>> search as typical for XML network routers, message queuing system,
>>>>>> P2P
>>>>>> networks, etc. In this on-the-fly main memory indexing scenario,
>>>>>> each
>>>>>>
>>>>>> individual string is immediately matched as soon as it becomes
>>>>>> available without any persistance involved. This usage scenario
>>>>>> and
>>>>>> corresponding performance profile is quite different in
>>>>>> comparison to
>>>>>>
>>>>>> fulltext search over persistent (read-mostly) indexes.
>>>>>>
>>>>>> The benchmark runs at some 3000 lucene queries/sec (lucene-1.4.3)
>>>>>> which
>>>>>> is unfortunate news considering the XQuery engine can easily walk
>>>>>> hundreds of thousands of XML nodes per second. Ideally I'd like to
>>>>>> run
>>>>>> at some 100000 queries/sec. Runnning this through the JDK 1.5
>>>>>> profiler
>>>>>> it seems that most time is spent in and below the following calls:
>>>>>>
>>>>>> writer = new IndexWriter(dir, analyzer, true);
>>>>>> writer.addDocument(...);
>>>>>> writer.close();
>>>>>>
>>>>>> I tried quite a few variants of the benchmark with various
>>>>>> options,
>>>>>> unfortunately with little or no effect.
>>>>>> Lucene just does not seem to designed to do this sort of
>>>>>> "transient
>>>>>> single string index" thing. All code paths related to opening,
>>>>>> closing,
>>>>>> reading, writing, querying and object creation seem to be designed
>>>>>> for
>>>>>> large persistent indexes.
>>>>>>
>>>>>> Any advice on what I'm missing or what could be done about it
>>>>>> would
>>>>>> be
>>>>>> greatly appreciated.
>>>>>>
>>>>>> Wolfgang.
>>>>>>
>>>>>> P.S. the benchmark code is attached as a file below:
>>>>>>
>>>>>>> package nux.xom.pool;
>>>>>>
>>>>>> import java.io.IOException;
>>>>>> //import java.io.Reader;
>>>>>>
>>>>>> import org.apache.lucene.analysis.Analyzer;
>>>>>> //import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>> //import org.apache.lucene.analysis.PorterStemFilter;
>>>>>> //import org.apache.lucene.analysis.SimpleAnalyzer;
>>>>>> //import org.apache.lucene.analysis.TokenStream;
>>>>>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>>>>>> import org.apache.lucene.document.Document;
>>>>>> import org.apache.lucene.document.Field;
>>>>>> //import org.apache.lucene.index.IndexReader;
>>>>>> import org.apache.lucene.index.IndexWriter;
>>>>>> import org.apache.lucene.queryParser.ParseException;
>>>>>> import org.apache.lucene.queryParser.QueryParser;
>>>>>> import org.apache.lucene.search.Hits;
>>>>>> import org.apache.lucene.search.IndexSearcher;
>>>>>> import org.apache.lucene.search.Query;
>>>>>> import org.apache.lucene.search.Searcher;
>>>>>> import org.apache.lucene.store.Directory;
>>>>>> import org.apache.lucene.store.RAMDirectory;
>>>>>>
>>>>>> public final class LuceneMatcher { // TODO: make non-public
>>>>>>
>>>>>> private final Analyzer analyzer;
>>>>>> // private final Directory dir = new RAMDirectory();
>>>>>>
>>>>>> public LuceneMatcher() {
>>>>>> this(new StandardAnalyzer());
>>>>>> // this(new SimpleAnalyzer());
>>>>>> // this(new StopAnalyzer());
>>>>>> // this(new Analyzer() {
>>>>>> // public final TokenStream tokenStream(String fieldName, Reader
>>>>>> reader) {
>>>>>> // return new PorterStemFilter(new LowerCaseTokenizer(reader));
>>>>>> // }
>>>>>> // });
>>>>>> }
>>>>>>
>>>>>> public LuceneMatcher(Analyzer analyzer) {
>>>>>> if (analyzer == null)
>>>>>> throw new IllegalArgumentException("analyzer must not be
>>>>>> null");
>>>>>> this.analyzer = analyzer;
>>>>>> }
>>>>>>
>>>>>> public Query parseQuery(String expression) throws ParseException
>>>>>> {
>>>>>> QueryParser parser = new QueryParser("content", analyzer);
>>>>>> // parser.setPhraseSlop(0);
>>>>>> return parser.parse(expression);
>>>>>> }
>>>>>>
>>>>>> /**
>>>>>> * Returns the relevance score by matching the given index
>>>>>> against
>>>>>> the given
>>>>>> * Lucene query expression. The index must not contain more than
>>>>>> one
>>>>>> Lucene
>>>>>> * "document" (aka string to be searched).
>>>>>> */
>>>>>> public float match(Directory index, Query query) {
>>>>>> Searcher searcher = null;
>>>>>> try {
>>>>>> searcher = new IndexSearcher(index);
>>>>>> Hits hits = searcher.search(query);
>>>>>> float score = hits.length() > 0 ? hits.score(0) : 0.0f;
>>>>>> return score;
>>>>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>>>>> throw new RuntimeException(e);
>>>>>> } finally {
>>>>>> try {
>>>>>> if (searcher != null) searcher.close();
>>>>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>>>>> throw new RuntimeException(e);
>>>>>> }
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> // public float match(String text, Query query) {
>>>>>> // return match(createIndex(text), query);
>>>>>> // }
>>>>>>
>>>>>> public Directory createIndex(String text) {
>>>>>> Directory dir = new RAMDirectory();
>>>>>> IndexWriter writer = null;
>>>>>> try {
>>>>>> writer = new IndexWriter(dir, analyzer, true);
>>>>>> // writer.setUseCompoundFile(false);
>>>>>> // writer.mergeFactor = 2;
>>>>>> // writer.minMergeDocs = 1;
>>>>>> // writer.maxMergeDocs = 1;
>>>>>>
>>>>>> writer.addDocument(createDocument(text));
>>>>>> // writer.optimize();
>>>>>> return dir;
>>>>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>>>>> throw new RuntimeException(e);
>>>>>> } finally {
>>>>>> try {
>>>>>> if (writer != null) writer.close();
>>>>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>>>>> throw new RuntimeException(e);
>>>>>> }
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> private Document createDocument(String content) {
>>>>>> Document doc = new Document();
>>>>>> doc.add(Field.UnStored("content", content));
>>>>>> // doc.add(Field.Text("x", content));
>>>>>> return doc;
>>>>>> }
>>>>>>
>>>>>> /**
>>>>>> * Lucene microbenchmark simulating STREAMING XQuery fulltext
>>>>>> search
>>>>>> as
>>>>>> * typical for XML network routers, message queuing system, P2P
>>>>>> networks,
>>>>>> * etc. In this on-the-fly main memory indexing scenario, each
>>>>>> individual
>>>>>> * string is immediately matched as soon as it becomes available
>>>>>> without any
>>>>>> * persistance involved. This usage scenario and corresponding
>>>>>> performance
>>>>>> * profile is quite different in comparison to fulltext search
>>>>>> over
>>>>>> * persistent (read-mostly) indexes.
>>>>>> *
>>>>>> * Example XPath:
>>>>>> count(/table/row[lucene:match(string(./firstname),
>>>>>> * "James") > 0.0])
>>>>>> */
>>>>>> public static void main(String[] args) throws Exception {
>>>>>> int k = -1;
>>>>>> int runs = 5;
>>>>>> if (args.length > ++k) runs = Integer.parseInt(args[k]);
>>>>>>
>>>>>> int nodes = 10000;
>>>>>> if (args.length > ++k) nodes = Integer.parseInt(args[k]);
>>>>>>
>>>>>> String content = "James is out in the woods";
>>>>>> if (args.length > ++k) content = args[k];
>>>>>>
>>>>>> String expression = "James";
>>>>>> if (args.length > ++k) expression = args[k];
>>>>>>
>>>>>> LuceneMatcher matcher = new LuceneMatcher();
>>>>>> Query query = matcher.parseQuery(expression); // to be reused N
>>>>>> times
>>>>>>
>>>>>> for (int r = 0; r < runs; r++) {
>>>>>> long start = System.currentTimeMillis();
>>>>>> int matches = 0;
>>>>>>
>>>>>> for (int i = 0; i < nodes; i++) {
>>>>>> // if (LuceneUtil.match(content + i, expression) > 0.0f) {
>>>>>> if (matcher.match(matcher.createIndex(content + i), query) >
>>>>>> 0.0f) {
>>>>>> matches++;
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> long end = System.currentTimeMillis();
>>>>>> System.out.println("matches=" + matches);
>>>>>> System.out.println("secs=" + ((end-start) / 1000.0f));
>>>>>> System.out.println("queries/sec=" + (nodes / ((end-start) /
>>>>>> 1000.0f)));
>>>>>> System.out.println();
>>>>>> }
>>>>>> }
>>>>>> }
>>
>> public class StringIndexReader extends IndexReader {
>> private List terms;
>> public StringIndexReader(String strings[]) {
>> super(null);
>> terms = Arrays.asList(strings);
>> Collections.sort(terms);
>> }
>>
>> public TermFreqVector[] getTermFreqVectors(int docNumber) throws
>> IOException {
>> System.out.println("StringIndexReader.getTermFreqVectors");
>> return new TermFreqVector[0];
>> }
>>
>> public TermFreqVector getTermFreqVector(int docNumber, String
>> field) throws IOException {
>> System.out.println("StringIndexReader.getTermFreqVector");
>> return null;
>> }
>>
>> public int numDocs() {
>> System.out.println("StringIndexReader.numDocs");
>> return 1;
>> }
>>
>> public int maxDoc() {
>> System.out.println("StringIndexReader.maxDoc");
>> return 1;
>> }
>>
>> public Document document(int n) throws IOException {
>> System.out.println("StringIndexReader.document");
>> return null;
>> }
>>
>> public boolean isDeleted(int n) {
>> System.out.println("StringIndexReader.isDeleted");
>> return false;
>> }
>>
>> public boolean hasDeletions() {
>> System.out.println("StringIndexReader.hasDeletions");
>> return false;
>> }
>>
>> public byte[] norms(String field) throws IOException {
>> // TODO: what value to use for this?
>> System.out.println("StringIndexReader.norms: " + field);
>> return new byte[] { 1 };
>> }
>>
>> public void norms(String field, byte[] bytes, int offset) throws
>> IOException {
>> System.out.println("StringIndexReader.norms: " + field + "*");
>> }
>>
>> protected void doSetNorm(int doc, String field, byte value) throws
>> IOException {
>> System.out.println("StringIndexReader.doSetNorm");
>>
>> }
>>
>> public TermEnum terms() throws IOException {
>> System.out.println("StringIndexReader.terms");
>> return terms(null);
>> }
>>
>> public TermEnum terms(final Term term) throws IOException {
>> System.out.println("StringIndexReader.terms: " + term);
>>
>> TermEnum termEnum = new TermEnum() {
>> private String currentTerm;
>> private Iterator iter;
>>
>> public boolean next() {
>> System.out.println("TermEnum.next");
>> if (iter.hasNext())
>> currentTerm = (String) iter.next();
>> return iter.hasNext();
>> }
>>
>> public Term term() {
>> if (iter == null) {
>> iter = terms.iterator();
>> while (next()) {
>> if (currentTerm.startsWith(term.text()))
>> break;
>> }
>> }
>> System.out.println("TermEnum.term: " + currentTerm);
>> return new Term(term.field(), currentTerm);
>> }
>>
>> public int docFreq() {
>> System.out.println("TermEnum.docFreq");
>> return 1;
>> }
>>
>> public void close() {
>> System.out.println("TermEnum.close");
>> }
>> };
>> return termEnum;
>> }
>>
>> public int docFreq(Term term) throws IOException {
>> System.out.println("StringIndexReader.docFreq: " + term);
>> return terms.contains(term.text()) ? 1 : 0;
>> }
>>
>> public TermDocs termDocs() throws IOException {
>> System.out.println("StringIndexReader.termDocs");
>>
>> TermDocs td = new TermDocs() {
>> private boolean done = false;
>> String currentTerm;
>>
>> public void seek(Term term) {
>> System.out.println(".seek: " + term);
>> currentTerm = term.text();
>> done = false;
>> }
>>
>> public void seek(TermEnum termEnum) {
>> seek(termEnum.term());
>> }
>>
>> public int doc() {
>> System.out.println(".doc");
>> return 0;
>> }
>>
>> public int freq() {
>> System.out.println(".freq");
>> return 1;
>> }
>>
>> public boolean next() {
>> System.out.println(".next");
>> return false;
>> }
>>
>> public int read(int[] docs, int[] freqs) {
>> System.out.println(".read: " + docs.length);
>>
>> if (done) return 0;
>>
>> done = true;
>> docs[0] = 0;
>> freqs[0] = freq();
>> return 1;
>> }
>>
>> public boolean skipTo(int target) {
>> System.out.println(".skipTo");
>> return false;
>> }
>>
>> public void close() {
>> System.out.println(".close");
>>
>> }
>> };
>> return td;
>> }
>>
>> public TermPositions termPositions() throws IOException {
>> System.out.println("StringIndexReader.termPositions");
>> return null;
>> }
>>
>> protected void doDelete(int docNum) throws IOException {
>> System.out.println("StringIndexReader.doDelete");
>>
>> }
>>
>> protected void doUndeleteAll() throws IOException {
>> System.out.println("StringIndexReader.doUndeleteAll");
>>
>> }
>>
>> protected void doCommit() throws IOException {
>> System.out.println("StringIndexReader.doCommit");
>>
>> }
>>
>> protected void doClose() throws IOException {
>> System.out.println("StringIndexReader.doClose");
>>
>> }
>>
>> public Collection getFieldNames() throws IOException {
>> System.out.println("StringIndexReader.getFieldNames");
>> return null;
>> }
>>
>> public Collection getFieldNames(boolean indexed) throws IOException
>> {
>> System.out.println("StringIndexReader.getFieldNames");
>> return null;
>> }
>>
>> public Collection getIndexedFieldNames(Field.TermVector tvSpec) {
>> System.out.println("StringIndexReader.getIndexedFieldNames");
>> return null;
>> }
>>
>> public Collection getFieldNames(FieldOption fldOption) {
>> System.out.println("StringIndexReader.getFieldNames");
>> return null;
>> }
>>
>> public static void main(String[] args) {
>> IndexReader reader = new StringIndexReader(new String[] {"foo",
>> "bar", "baz"});
>> IndexSearcher searcher = new IndexSearcher(reader);
>>
>> Hits hits = null;
>> try {
>> hits = searcher.search(new WildcardQuery(new
>> Term("field","ba*")));
>> } catch (IOException e) {
>> e.printStackTrace();
>> }
>> System.out.println("found " + hits.length());
>> }
>> }
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Wolfgang Hoschek <wh...@lbl.gov>.
A primary reason for the tight API proposed below is that it allows for
maximum efficiency (which is the point of the exercise in the first
place):
- One can extend Searcher rather than IndexReader: There's no need to
subclass IndexReader and carry the baggage of the superclass doing
locking and all sort of unnecessary stuff with its internal
RAMDirectory.
- Even more extreme: Don't extend Searcher but implement the
functionality directly using low-level APIs. This avoids unnecessary
baggage for collecting hits, etc.
Wolfgang.
On Apr 15, 2005, at 3:15 PM, Wolfgang Hoschek wrote:
> Cool! For my use case it would need to be able to handle arbitrary
> queries (previously parsed from a general lucene query string).
> Something like:
>
> float match(String Text, Query query)
>
> it's fine with me if it also works for
>
> float[] match(String[] texts, Query query) or
> float(Document doc, Query query)
>
> but that isn't required by the use case.
>
> Wolfgang.
>
>> I am intrigued by this and decided to mock a quick and dirty example
>> of such an IndexReader. After a little trial-and-error I got it
>> working at least for TermQuery and WildcardQuery. I've pasted my
>> code below as an example, but there is much room for improvement,
>> especially in terms of performance and also in keeping track of term
>> frequency, and also it would be nicer if it handled the analysis
>> internally.
>>
>> I think something like this would make a handy addition to our
>> contrib area at least. I'd be happy to receive improvements to this
>> and then add it to a contrib subproject.
>>
>> Perhaps this would be a handy way to handle situations where users
>> have queries saved in a system and need to be alerted whenever a new
>> document arrives matching the saved queries?
>>
>> Erik
>>
>>
>>
>>>
>>>
>>> -----Original Message-----
>>> From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
>>> Sent: Thursday, April 14, 2005 4:04 PM
>>> To: java-dev@lucene.apache.org
>>> Subject: Re: [Performance] Streaming main memory indexing of single
>>> strings
>>>
>>>
>>> This seems to be a promising avenue worth exploring. My gutfeeling is
>>> that this could easily be 10-100 times faster.
>>>
>>> The drawback is that it requires a fair amount of understanding of
>>> intricate Lucene internals, pulling those pieces together and
>>> adapting
>>> them as required for the seemingly simple "float match(String text,
>>> Query query)".
>>>
>>> I might give it a shot but I'm not sure I'll be able to pull this
>>> off!
>>> Is there any similar code I could look at as a starting point?
>>>
>>> Wolfgang.
>>>
>>> On Apr 14, 2005, at 1:13 PM, Robert Engels wrote:
>>>
>>>> I think you are not approaching this the correct way.
>>>>
>>>> Pseudo code:
>>>>
>>>> Subclass IndexReader.
>>>>
>>>> Get tokens from String 'document' using Lucene analyzers.
>>>>
>>>> Build simple hash-map based data structures using tokens for terms,
>>>> and term
>>>> positions.
>>>>
>>>> reimplement termDocs() and termPositions() to use the structures
>>>> from
>>>> above.
>>>>
>>>> run searches.
>>>>
>>>> start again with next document.
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
>>>> Sent: Thursday, April 14, 2005 2:56 PM
>>>> To: java-dev@lucene.apache.org
>>>> Subject: Re: [Performance] Streaming main memory indexing of single
>>>> strings
>>>>
>>>>
>>>> Otis, this might be a misunderstanding.
>>>>
>>>> - I'm not calling optimize(). That piece is commented out you if
>>>> look
>>>> again at the code.
>>>> - The *streaming* use case requires that for each query I add one
>>>> (and
>>>> only one) document (aka string) to an empty index:
>>>>
>>>> repeat N times (where N is millions or billions):
>>>> add a single string (aka document) to an empty index
>>>> query the index
>>>> drop index (or delete it's document)
>>>>
>>>> with the following API being called N times: float match(String
>>>> text,
>>>> Query query)
>>>>
>>>> So there's no possibility of adding many documents and thereafter
>>>> running the query. This in turn seems to mean that the IndexWriter
>>>> can't be kept open - unless I manually delete each document after
>>>> each
>>>> query to repeatedly reuse the RAMDirectory, which I've also tried
>>>> before without any significant performance gain - deletion seems to
>>>> have substantial overhead in itself. Perhaps it would be better if
>>>> there were a Directory.deleteAllDocuments() or similar. Did you have
>>>> some other approach in mind?
>>>>
>>>> As I said, Lucene's design doesn't seem to fit this streaming use
>>>> case
>>>> pattern well. In *this* scenario one could easily do without any
>>>> locking, and without byte level organization in RAMDirectory and
>>>> RAMFile, etc because a single small string isn't a large persistent
>>>> multi-document index.
>>>>
>>>> For some background, here's a small example for the kind of XQuery
>>>> functionality Nux/Lucene integration enables:
>>>>
>>>> (: An XQuery that finds all books authored by James that have
>>>> something
>>>> to do with "fish", sorted by relevance :)
>>>> declare namespace lucene = "java:nux.xom.xquery.XQueryUtil";
>>>> declare variable $query := "fish*~";
>>>>
>>>> for $book in /books/book[author="James" and lucene:match(string(.),
>>>> $query) > 0.0]
>>>> let $score := lucene:match(string($book), $query)
>>>> order by $score descending
>>>> return (<score>{$score}</score>, $book)
>>>>
>>>> More interestingly one can use this for classifying and routing XML
>>>> messages based on rules (i.e. queries) inspecting their content...
>>>>
>>>> Any other clues about potential improvements would be greatly
>>>> appreciated.
>>>>
>>>> Wolfgang.
>>>>
>>>> On Apr 13, 2005, at 10:09 PM, Otis Gospodnetic wrote:
>>>>
>>>>> It looks like you are calling that IndexWriter code in some loops,
>>>>> opening it and closing it in every iteration of the loop and also
>>>>> calling optimize. All of those things could be improved.
>>>>> Keep your IndexWriter open, don't close it, and optimize the index
>>>>> only
>>>>> once you are done adding documents to it.
>>>>>
>>>>> See the highlights and the snipets in the first hit:
>>>>> http://www.lucenebook.com/search?query=when+to+optimize
>>>>>
>>>>> Otis
>>>>>
>>>>>
>>>>> --- Wolfgang Hoschek <wh...@lbl.gov> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm wondering if anyone could let me know how to improve Lucene
>>>>>> performance for "streaming main memory indexing of single
>>>>>> strings".
>>>>>> This would help to effectively integrate Lucene with the Nux
>>>>>> XQuery
>>>>>> engine.
>>>>>>
>>>>>> Below is a small microbenchmark simulating STREAMING XQuery
>>>>>> fulltext
>>>>>> search as typical for XML network routers, message queuing system,
>>>>>> P2P
>>>>>> networks, etc. In this on-the-fly main memory indexing scenario,
>>>>>> each
>>>>>>
>>>>>> individual string is immediately matched as soon as it becomes
>>>>>> available without any persistance involved. This usage scenario
>>>>>> and
>>>>>> corresponding performance profile is quite different in
>>>>>> comparison to
>>>>>>
>>>>>> fulltext search over persistent (read-mostly) indexes.
>>>>>>
>>>>>> The benchmark runs at some 3000 lucene queries/sec (lucene-1.4.3)
>>>>>> which
>>>>>> is unfortunate news considering the XQuery engine can easily walk
>>>>>> hundreds of thousands of XML nodes per second. Ideally I'd like to
>>>>>> run
>>>>>> at some 100000 queries/sec. Runnning this through the JDK 1.5
>>>>>> profiler
>>>>>> it seems that most time is spent in and below the following calls:
>>>>>>
>>>>>> writer = new IndexWriter(dir, analyzer, true);
>>>>>> writer.addDocument(...);
>>>>>> writer.close();
>>>>>>
>>>>>> I tried quite a few variants of the benchmark with various
>>>>>> options,
>>>>>> unfortunately with little or no effect.
>>>>>> Lucene just does not seem to designed to do this sort of
>>>>>> "transient
>>>>>> single string index" thing. All code paths related to opening,
>>>>>> closing,
>>>>>> reading, writing, querying and object creation seem to be designed
>>>>>> for
>>>>>> large persistent indexes.
>>>>>>
>>>>>> Any advice on what I'm missing or what could be done about it
>>>>>> would
>>>>>> be
>>>>>> greatly appreciated.
>>>>>>
>>>>>> Wolfgang.
>>>>>>
>>>>>> P.S. the benchmark code is attached as a file below:
>>>>>>
>>>>>>> package nux.xom.pool;
>>>>>>
>>>>>> import java.io.IOException;
>>>>>> //import java.io.Reader;
>>>>>>
>>>>>> import org.apache.lucene.analysis.Analyzer;
>>>>>> //import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>> //import org.apache.lucene.analysis.PorterStemFilter;
>>>>>> //import org.apache.lucene.analysis.SimpleAnalyzer;
>>>>>> //import org.apache.lucene.analysis.TokenStream;
>>>>>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>>>>>> import org.apache.lucene.document.Document;
>>>>>> import org.apache.lucene.document.Field;
>>>>>> //import org.apache.lucene.index.IndexReader;
>>>>>> import org.apache.lucene.index.IndexWriter;
>>>>>> import org.apache.lucene.queryParser.ParseException;
>>>>>> import org.apache.lucene.queryParser.QueryParser;
>>>>>> import org.apache.lucene.search.Hits;
>>>>>> import org.apache.lucene.search.IndexSearcher;
>>>>>> import org.apache.lucene.search.Query;
>>>>>> import org.apache.lucene.search.Searcher;
>>>>>> import org.apache.lucene.store.Directory;
>>>>>> import org.apache.lucene.store.RAMDirectory;
>>>>>>
>>>>>> public final class LuceneMatcher { // TODO: make non-public
>>>>>>
>>>>>> private final Analyzer analyzer;
>>>>>> // private final Directory dir = new RAMDirectory();
>>>>>>
>>>>>> public LuceneMatcher() {
>>>>>> this(new StandardAnalyzer());
>>>>>> // this(new SimpleAnalyzer());
>>>>>> // this(new StopAnalyzer());
>>>>>> // this(new Analyzer() {
>>>>>> // public final TokenStream tokenStream(String fieldName, Reader
>>>>>> reader) {
>>>>>> // return new PorterStemFilter(new LowerCaseTokenizer(reader));
>>>>>> // }
>>>>>> // });
>>>>>> }
>>>>>>
>>>>>> public LuceneMatcher(Analyzer analyzer) {
>>>>>> if (analyzer == null)
>>>>>> throw new IllegalArgumentException("analyzer must not be
>>>>>> null");
>>>>>> this.analyzer = analyzer;
>>>>>> }
>>>>>>
>>>>>> public Query parseQuery(String expression) throws ParseException
>>>>>> {
>>>>>> QueryParser parser = new QueryParser("content", analyzer);
>>>>>> // parser.setPhraseSlop(0);
>>>>>> return parser.parse(expression);
>>>>>> }
>>>>>>
>>>>>> /**
>>>>>> * Returns the relevance score by matching the given index
>>>>>> against
>>>>>> the given
>>>>>> * Lucene query expression. The index must not contain more than
>>>>>> one
>>>>>> Lucene
>>>>>> * "document" (aka string to be searched).
>>>>>> */
>>>>>> public float match(Directory index, Query query) {
>>>>>> Searcher searcher = null;
>>>>>> try {
>>>>>> searcher = new IndexSearcher(index);
>>>>>> Hits hits = searcher.search(query);
>>>>>> float score = hits.length() > 0 ? hits.score(0) : 0.0f;
>>>>>> return score;
>>>>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>>>>> throw new RuntimeException(e);
>>>>>> } finally {
>>>>>> try {
>>>>>> if (searcher != null) searcher.close();
>>>>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>>>>> throw new RuntimeException(e);
>>>>>> }
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> // public float match(String text, Query query) {
>>>>>> // return match(createIndex(text), query);
>>>>>> // }
>>>>>>
>>>>>> public Directory createIndex(String text) {
>>>>>> Directory dir = new RAMDirectory();
>>>>>> IndexWriter writer = null;
>>>>>> try {
>>>>>> writer = new IndexWriter(dir, analyzer, true);
>>>>>> // writer.setUseCompoundFile(false);
>>>>>> // writer.mergeFactor = 2;
>>>>>> // writer.minMergeDocs = 1;
>>>>>> // writer.maxMergeDocs = 1;
>>>>>>
>>>>>> writer.addDocument(createDocument(text));
>>>>>> // writer.optimize();
>>>>>> return dir;
>>>>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>>>>> throw new RuntimeException(e);
>>>>>> } finally {
>>>>>> try {
>>>>>> if (writer != null) writer.close();
>>>>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>>>>> throw new RuntimeException(e);
>>>>>> }
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> private Document createDocument(String content) {
>>>>>> Document doc = new Document();
>>>>>> doc.add(Field.UnStored("content", content));
>>>>>> // doc.add(Field.Text("x", content));
>>>>>> return doc;
>>>>>> }
>>>>>>
>>>>>> /**
>>>>>> * Lucene microbenchmark simulating STREAMING XQuery fulltext
>>>>>> search
>>>>>> as
>>>>>> * typical for XML network routers, message queuing system, P2P
>>>>>> networks,
>>>>>> * etc. In this on-the-fly main memory indexing scenario, each
>>>>>> individual
>>>>>> * string is immediately matched as soon as it becomes available
>>>>>> without any
>>>>>> * persistance involved. This usage scenario and corresponding
>>>>>> performance
>>>>>> * profile is quite different in comparison to fulltext search
>>>>>> over
>>>>>> * persistent (read-mostly) indexes.
>>>>>> *
>>>>>> * Example XPath:
>>>>>> count(/table/row[lucene:match(string(./firstname),
>>>>>> * "James") > 0.0])
>>>>>> */
>>>>>> public static void main(String[] args) throws Exception {
>>>>>> int k = -1;
>>>>>> int runs = 5;
>>>>>> if (args.length > ++k) runs = Integer.parseInt(args[k]);
>>>>>>
>>>>>> int nodes = 10000;
>>>>>> if (args.length > ++k) nodes = Integer.parseInt(args[k]);
>>>>>>
>>>>>> String content = "James is out in the woods";
>>>>>> if (args.length > ++k) content = args[k];
>>>>>>
>>>>>> String expression = "James";
>>>>>> if (args.length > ++k) expression = args[k];
>>>>>>
>>>>>> LuceneMatcher matcher = new LuceneMatcher();
>>>>>> Query query = matcher.parseQuery(expression); // to be reused N
>>>>>> times
>>>>>>
>>>>>> for (int r = 0; r < runs; r++) {
>>>>>> long start = System.currentTimeMillis();
>>>>>> int matches = 0;
>>>>>>
>>>>>> for (int i = 0; i < nodes; i++) {
>>>>>> // if (LuceneUtil.match(content + i, expression) > 0.0f) {
>>>>>> if (matcher.match(matcher.createIndex(content + i), query) >
>>>>>> 0.0f) {
>>>>>> matches++;
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> long end = System.currentTimeMillis();
>>>>>> System.out.println("matches=" + matches);
>>>>>> System.out.println("secs=" + ((end-start) / 1000.0f));
>>>>>> System.out.println("queries/sec=" + (nodes / ((end-start) /
>>>>>> 1000.0f)));
>>>>>> System.out.println();
>>>>>> }
>>>>>> }
>>>>>> }
>>
>> public class StringIndexReader extends IndexReader {
>> private List terms;
>> public StringIndexReader(String strings[]) {
>> super(null);
>> terms = Arrays.asList(strings);
>> Collections.sort(terms);
>> }
>>
>> public TermFreqVector[] getTermFreqVectors(int docNumber) throws
>> IOException {
>> System.out.println("StringIndexReader.getTermFreqVectors");
>> return new TermFreqVector[0];
>> }
>>
>> public TermFreqVector getTermFreqVector(int docNumber, String
>> field) throws IOException {
>> System.out.println("StringIndexReader.getTermFreqVector");
>> return null;
>> }
>>
>> public int numDocs() {
>> System.out.println("StringIndexReader.numDocs");
>> return 1;
>> }
>>
>> public int maxDoc() {
>> System.out.println("StringIndexReader.maxDoc");
>> return 1;
>> }
>>
>> public Document document(int n) throws IOException {
>> System.out.println("StringIndexReader.document");
>> return null;
>> }
>>
>> public boolean isDeleted(int n) {
>> System.out.println("StringIndexReader.isDeleted");
>> return false;
>> }
>>
>> public boolean hasDeletions() {
>> System.out.println("StringIndexReader.hasDeletions");
>> return false;
>> }
>>
>> public byte[] norms(String field) throws IOException {
>> // TODO: what value to use for this?
>> System.out.println("StringIndexReader.norms: " + field);
>> return new byte[] { 1 };
>> }
>>
>> public void norms(String field, byte[] bytes, int offset) throws
>> IOException {
>> System.out.println("StringIndexReader.norms: " + field + "*");
>> }
>>
>> protected void doSetNorm(int doc, String field, byte value) throws
>> IOException {
>> System.out.println("StringIndexReader.doSetNorm");
>>
>> }
>>
>> public TermEnum terms() throws IOException {
>> System.out.println("StringIndexReader.terms");
>> return terms(null);
>> }
>>
>> public TermEnum terms(final Term term) throws IOException {
>> System.out.println("StringIndexReader.terms: " + term);
>>
>> TermEnum termEnum = new TermEnum() {
>> private String currentTerm;
>> private Iterator iter;
>>
>> public boolean next() {
>> System.out.println("TermEnum.next");
>> if (iter.hasNext())
>> currentTerm = (String) iter.next();
>> return iter.hasNext();
>> }
>>
>> public Term term() {
>> if (iter == null) {
>> iter = terms.iterator();
>> while (next()) {
>> if (currentTerm.startsWith(term.text()))
>> break;
>> }
>> }
>> System.out.println("TermEnum.term: " + currentTerm);
>> return new Term(term.field(), currentTerm);
>> }
>>
>> public int docFreq() {
>> System.out.println("TermEnum.docFreq");
>> return 1;
>> }
>>
>> public void close() {
>> System.out.println("TermEnum.close");
>> }
>> };
>> return termEnum;
>> }
>>
>> public int docFreq(Term term) throws IOException {
>> System.out.println("StringIndexReader.docFreq: " + term);
>> return terms.contains(term.text()) ? 1 : 0;
>> }
>>
>> public TermDocs termDocs() throws IOException {
>> System.out.println("StringIndexReader.termDocs");
>>
>> TermDocs td = new TermDocs() {
>> private boolean done = false;
>> String currentTerm;
>>
>> public void seek(Term term) {
>> System.out.println(".seek: " + term);
>> currentTerm = term.text();
>> done = false;
>> }
>>
>> public void seek(TermEnum termEnum) {
>> seek(termEnum.term());
>> }
>>
>> public int doc() {
>> System.out.println(".doc");
>> return 0;
>> }
>>
>> public int freq() {
>> System.out.println(".freq");
>> return 1;
>> }
>>
>> public boolean next() {
>> System.out.println(".next");
>> return false;
>> }
>>
>> public int read(int[] docs, int[] freqs) {
>> System.out.println(".read: " + docs.length);
>>
>> if (done) return 0;
>>
>> done = true;
>> docs[0] = 0;
>> freqs[0] = freq();
>> return 1;
>> }
>>
>> public boolean skipTo(int target) {
>> System.out.println(".skipTo");
>> return false;
>> }
>>
>> public void close() {
>> System.out.println(".close");
>>
>> }
>> };
>> return td;
>> }
>>
>> public TermPositions termPositions() throws IOException {
>> System.out.println("StringIndexReader.termPositions");
>> return null;
>> }
>>
>> protected void doDelete(int docNum) throws IOException {
>> System.out.println("StringIndexReader.doDelete");
>>
>> }
>>
>> protected void doUndeleteAll() throws IOException {
>> System.out.println("StringIndexReader.doUndeleteAll");
>>
>> }
>>
>> protected void doCommit() throws IOException {
>> System.out.println("StringIndexReader.doCommit");
>>
>> }
>>
>> protected void doClose() throws IOException {
>> System.out.println("StringIndexReader.doClose");
>>
>> }
>>
>> public Collection getFieldNames() throws IOException {
>> System.out.println("StringIndexReader.getFieldNames");
>> return null;
>> }
>>
>> public Collection getFieldNames(boolean indexed) throws IOException
>> {
>> System.out.println("StringIndexReader.getFieldNames");
>> return null;
>> }
>>
>> public Collection getIndexedFieldNames(Field.TermVector tvSpec) {
>> System.out.println("StringIndexReader.getIndexedFieldNames");
>> return null;
>> }
>>
>> public Collection getFieldNames(FieldOption fldOption) {
>> System.out.println("StringIndexReader.getFieldNames");
>> return null;
>> }
>>
>> public static void main(String[] args) {
>> IndexReader reader = new StringIndexReader(new String[] {"foo",
>> "bar", "baz"});
>> IndexSearcher searcher = new IndexSearcher(reader);
>>
>> Hits hits = null;
>> try {
>> hits = searcher.search(new WildcardQuery(new
>> Term("field","ba*")));
>> } catch (IOException e) {
>> e.printStackTrace();
>> }
>> System.out.println("found " + hits.length());
>> }
>> }
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Wolfgang Hoschek <wh...@lbl.gov>.
>
> On Apr 16, 2005, at 1:17 PM, Wolfgang Hoschek wrote:
>>> Note that "fish*~" is not a valid query expression :)
>>
>> Perhaps the Lucene QueryParser should throw an exception then.
>> Currently 1.4.3 accepts the expression as is without grumbling...
>
> Several minor QueryParser weirdnesses like this have turned up
> recently. Sure enough, that is an odd one. It parses into a
> PrefixQuery for "fish*" and the ~ is dropped. I consider this a bug
> as this should really be a parse exception. I've just filed this as a
> bug:
>
> http://issues.apache.org/bugzilla/show_bug.cgi?id=34486
Thanks, Erik.
>
>
>> If you're looking for an XML DB for managing and querying large
>> persistent data volumes, Nux/Saxon will disappoint you.
>
> I want to store at least several hundred MB up to gigabytes and have
> this queryable with XQuery.
With some luck it might comfortably fit into a (compressed) main memory
cache.
> We previously used Tamino with XPath, but our XML is not well enough
> normalized to make this very feasible to query. eXist, last I toyed
> with it, only scaled to 50MB.
>
> Ok, so Nux/Saxon is out for our uses. Any recommendations though?
I can't recommend any product in particular. There's a comprehensive
list of impls at http://www.w3.org/XML/Query
Could you briefly summarize the key usecase and an example query? If
history is any indicator, then within 2-3 years the big relational DBMS
vendors will have caught up with XQuery/XML extensions and eat the
little special-purpose XML DB vendors for lunch - just like they did
with OODBMS.
Wolfgang.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Apr 16, 2005, at 1:17 PM, Wolfgang Hoschek wrote:
>> Note that "fish*~" is not a valid query expression :)
>
> Perhaps the Lucene QueryParser should throw an exception then.
> Currently 1.4.3 accepts the expression as is without grumbling...
Several minor QueryParser weirdnesses like this have turned up
recently. Sure enough, that is an odd one. It parses into a
PrefixQuery for "fish*" and the ~ is dropped. I consider this a bug as
this should really be a parse exception. I've just filed this as a
bug:
http://issues.apache.org/bugzilla/show_bug.cgi?id=34486
> If you're looking for an XML DB for managing and querying large
> persistent data volumes, Nux/Saxon will disappoint you.
I want to store at least several hundred MB up to gigabytes and have
this queryable with XQuery. We previously used Tamino with XPath, but
our XML is not well enough normalized to make this very feasible to
query. eXist, last I toyed with it, only scaled to 50MB.
Ok, so Nux/Saxon is out for our uses. Any recommendations though?
>> Could you avoid calling match() twice here?
>
> That's no problem for two reasons:
> 1) The XQuery optimizer rewrites the query into an optimized
> expression tree eliminating redundancies, etc. If for some reason this
> isn't feasible or legal then
> 2) There's a smart cache between the XQuery engine and the lucene
> invocation that returns results in O(1) for Lucene queries that have
> already been seen/processed before. It caches (queryString,result),
> plus parsed Lucene queries, plus the Lucene index data structure for
> any given string text (which currently is a simple RAMDirectory but
> could be whatever datastructure we come up with as part of the
> exercise - class StringIndex or some such). This works so well that I
> have to disable the cache to avoid getting astronomically good figures
> on artificial benchmarks.
Cool.
> BTW, I have some small performance patches for FastCharStream and in
> various other places, but I'll hold off proposing those until our
> exercise is done and the real merits/drawbacks of those patches can be
> better assessed.
Excellent... we're always interested in performance improvements!
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Wolfgang Hoschek <wh...@lbl.gov>.
On Apr 16, 2005, at 2:58 AM, Erik Hatcher wrote:
>
> On Apr 15, 2005, at 9:50 PM, Wolfgang Hoschek wrote:
>>> So, all the text analyzed is in a given field... that means that
>>> anything in the Query not associated with that field has no bearing
>>> on whether the text matches or not, correct?
>>
>> Right, it has no bearing. A query wouldn't specify any fields, it
>> just uses the implicit default field name.
>
> Cool. My questions regarding how to deal with field names is
> obviously more an implementation detail under the covers of the
> match() method than how you want to use it. In a general sense,
> though, its necessary to deal with default field name, queries that
> have non-default-field terms, and the analysis process.
Right, I'd just like to first assess rough overall efficiency before
tying up some loose ends.
>
>> (: An XQuery that finds all books authored by James that have
>> something to do with "fish", sorted by relevance :)
>> declare namespace lucene = "java:nux.xom.xquery.XQueryUtil";
>> declare variable $query := "fish*~"; (: any arbitrary fuzzy lucene
>> query goes here :)
>
> Note that "fish*~" is not a valid query expression :)
Perhaps the Lucene QueryParser should throw an exception then.
Currently 1.4.3 accepts the expression as is without grumbling...
> (I love how XQuery uses smiley emoticons for comments) BTW, I have a
> strong vested interest in seeing a fast and scalable XQuery engine in
> the open source world. I've toyed with eXist some - it was not stable
> or scalable enough for my needs. Lot's of Wolfgang's in the XQuery
> world :)
If you're looking for an XML DB for managing and querying large
persistent data volumes, Nux/Saxon will disappoint you. If, on the
other hand, you're looking for a very fast XQuery engine inserted into
a processing pipeline working with many small to medium sized XML
documents (such as messages in a scalable message queue or network
router) then you might be pleased.
>
>> for $book in /books/book[author="James" and lucene:match(string(.),
>> $query) > 0.0]
>> let $score := lucene:match(string($book), $query)
>> order by $score descending
>> return (<score>{$score}</score>, $book)
>
> Could you avoid calling match() twice here?
That's no problem for two reasons:
1) The XQuery optimizer rewrites the query into an optimized expression
tree eliminating redundancies, etc. If for some reason this isn't
feasible or legal then
2) There's a smart cache between the XQuery engine and the lucene
invocation that returns results in O(1) for Lucene queries that have
already been seen/processed before. It caches (queryString,result),
plus parsed Lucene queries, plus the Lucene index data structure for
any given string text (which currently is a simple RAMDirectory but
could be whatever datastructure we come up with as part of the exercise
- class StringIndex or some such). This works so well that I have to
disable the cache to avoid getting astronomically good figures on
artificial benchmarks.
>
>> some skeleton:
>>
>> private static final String FIELD_NAME = "content"; // or whatever -
>> it doesn't matter
>>
>> public Query parseQuery(String expression) throws ParseException {
>> QueryParser parser = new QueryParser(FIELD_NAME, analyzer);
>> return parser.parse(expression);
>> }
>>
>> private Document createDocument(String content) {
>> Document doc = new Document();
>> doc.add(Field.UnStored(FIELD_NAME, content));
>> return doc;
>> }
>
> This skeleton code doesn't really apply to the custom IndexReader
> implementation. There is a method to return a document from
> IndexReader, which I did not implement yet in my sample - it'd be
> trivial though. I don't think you'd need to get a Lucene Document
> object back in your use case, but for completeness I will add that to
> my implementation.
Right, it was just to outline that the value of FIELD_NAME doesn't
really matter.
>
>>> There is still some missing trickery in my StringIndexReader - it
>>> does not currently handle phrase queries as an implementation of
>>> termPositions() is needed.
>>>
>>> Wolfgang - will you take what I've done the extra mile and implement
>>> what's left (frequency and term position)? I might not revisit this
>>> very soon.
>>
>> I'm not sure I'll be able to pull it off, but I'll see what I can do.
>> If someone more competent would like to help out, let me know...
>> Thanks for all the help anyway, Erik and co, it is greatly
>> appreciated!
>
> If you can build an XQuery engine, you can hack in some basic Java
> data structures that keep track of word positions and frequency :)
There's a learning curve ahead of me, not having working before at that
low-level with Lucene :-)
Mark Harwood sent me some good but somewhat unfinished code he wrote
previously for similar scenarios. I'll look into merging his pieces and
your skeleton.
By now I'm quite confident this can be done reasonably efficient. BTW,
I have some small performance patches for FastCharStream and in various
other places, but I'll hold off proposing those until our exercise is
done and the real merits/drawbacks of those patches can be better
assessed.
>
> I'll tinker with it some more for fun in the near future, but anyone
> else is welcome to flesh out the missing pieces.
Thanks again for the kind helping out!
Wolfgang.
-----------------------------------------------------------------------
Wolfgang Hoschek | email: whoschek@lbl.gov
Distributed Systems Department | phone: (415)-533-7610
Berkeley Laboratory | http://dsd.lbl.gov/~hoschek/
-----------------------------------------------------------------------
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Apr 15, 2005, at 9:50 PM, Wolfgang Hoschek wrote:
>> So, all the text analyzed is in a given field... that means that
>> anything in the Query not associated with that field has no bearing
>> on whether the text matches or not, correct?
>
> Right, it has no bearing. A query wouldn't specify any fields, it just
> uses the implicit default field name.
Cool. My questions regarding how to deal with field names is obviously
more an implementation detail under the covers of the match() method
than how you want to use it. In a general sense, though, its necessary
to deal with default field name, queries that have non-default-field
terms, and the analysis process.
> (: An XQuery that finds all books authored by James that have
> something to do with "fish", sorted by relevance :)
> declare namespace lucene = "java:nux.xom.xquery.XQueryUtil";
> declare variable $query := "fish*~"; (: any arbitrary fuzzy lucene
> query goes here :)
Note that "fish*~" is not a valid query expression :) (I love how
XQuery uses smiley emoticons for comments) BTW, I have a strong vested
interest in seeing a fast and scalable XQuery engine in the open source
world. I've toyed with eXist some - it was not stable or scalable
enough for my needs. Lot's of Wolfgang's in the XQuery world :)
> for $book in /books/book[author="James" and lucene:match(string(.),
> $query) > 0.0]
> let $score := lucene:match(string($book), $query)
> order by $score descending
> return (<score>{$score}</score>, $book)
Could you avoid calling match() twice here?
> some skeleton:
>
> private static final String FIELD_NAME = "content"; // or whatever -
> it doesn't matter
>
> public Query parseQuery(String expression) throws ParseException {
> QueryParser parser = new QueryParser(FIELD_NAME, analyzer);
> return parser.parse(expression);
> }
>
> private Document createDocument(String content) {
> Document doc = new Document();
> doc.add(Field.UnStored(FIELD_NAME, content));
> return doc;
> }
This skeleton code doesn't really apply to the custom IndexReader
implementation. There is a method to return a document from
IndexReader, which I did not implement yet in my sample - it'd be
trivial though. I don't think you'd need to get a Lucene Document
object back in your use case, but for completeness I will add that to
my implementation.
>> There is still some missing trickery in my StringIndexReader - it
>> does not currently handle phrase queries as an implementation of
>> termPositions() is needed.
>>
>> Wolfgang - will you take what I've done the extra mile and implement
>> what's left (frequency and term position)? I might not revisit this
>> very soon.
>
> I'm not sure I'll be able to pull it off, but I'll see what I can do.
> If someone more competent would like to help out, let me know...
> Thanks for all the help anyway, Erik and co, it is greatly
> appreciated!
If you can build an XQuery engine, you can hack in some basic Java data
structures that keep track of word positions and frequency :)
I'll tinker with it some more for fun in the near future, but anyone
else is welcome to flesh out the missing pieces.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Wolfgang Hoschek <wh...@lbl.gov>.
On Apr 15, 2005, at 5:55 PM, Erik Hatcher wrote:
> On Apr 15, 2005, at 8:18 PM, Wolfgang Hoschek wrote:
>> The main issue is to enable handling arbitrary queries (anything
>> derived from o.a.l.search.Query). Yes, there'd be an additional
>> method Analyzer parameter (support any analyzer). The use case does
>> not require field names. One could internally use "content" or
>> anything else for the default field name, which is the one to
>> implicitly be queried...
>
> To be general-purpose, though, passing in the default field name would
> be needed. All of Lucene's built-in analyzers ignore it, but that is
> not the case everywhere. For example, NutchAnalyzer does different
> things for different fields.
>
> So, all the text analyzed is in a given field... that means that
> anything in the Query not associated with that field has no bearing on
> whether the text matches or not, correct?
Right, it has no bearing. A query wouldn't specify any fields, it just
uses the implicit default field name. Here's an example
(: An XQuery that finds all books authored by James that have something
to do with "fish", sorted by relevance :)
declare namespace lucene = "java:nux.xom.xquery.XQueryUtil";
declare variable $query := "fish*~"; (: any arbitrary fuzzy lucene
query goes here :)
for $book in /books/book[author="James" and lucene:match(string(.),
$query) > 0.0]
let $score := lucene:match(string($book), $query)
order by $score descending
return (<score>{$score}</score>, $book)
some skeleton:
private static final String FIELD_NAME = "content"; // or whatever -
it doesn't matter
public Query parseQuery(String expression) throws ParseException {
QueryParser parser = new QueryParser(FIELD_NAME, analyzer);
return parser.parse(expression);
}
private Document createDocument(String content) {
Document doc = new Document();
doc.add(Field.UnStored(FIELD_NAME, content));
return doc;
}
>
> There is still some missing trickery in my StringIndexReader - it does
> not currently handle phrase queries as an implementation of
> termPositions() is needed.
>
> Wolfgang - will you take what I've done the extra mile and implement
> what's left (frequency and term position)? I might not revisit this
> very soon.
I'm not sure I'll be able to pull it off, but I'll see what I can do.
If someone more competent would like to help out, let me know... Thanks
for all the help anyway, Erik and co, it is greatly appreciated!
Wolfgang.
>
> Erik
>
>
>
>>
>> Wolfgang.
>>
>> On Apr 15, 2005, at 5:08 PM, Erik Hatcher wrote:
>>
>>> On Apr 15, 2005, at 6:15 PM, Wolfgang Hoschek wrote:
>>>> Cool! For my use case it would need to be able to handle arbitrary
>>>> queries (previously parsed from a general lucene query string).
>>>> Something like:
>>>>
>>>> float match(String Text, Query query)
>>>>
>>>> it's fine with me if it also works for
>>>>
>>>> float[] match(String[] texts, Query query) or
>>>> float(Document doc, Query query)
>>>>
>>>> but that isn't required by the use case.
>>>
>>> My implementation is nearly that. The score is available as
>>> hits.score(0). You would also need an analyzer, I presume, passed
>>> to your proposed match() method if you want the text broken into
>>> terms. My current implementation is passed a String[] where each
>>> item is considered a term for the document. match() would also need
>>> a field name to be fully accurate - since the analyzer needs a field
>>> name and terms used for searching need a field name. The Query may
>>> contain terms for any number of fields - how should that be handled?
>>> Should only a single field name be passed in and any terms request
>>> for other fields be ignored? Or should this utility morph to assume
>>> any words in the text is in any field being asked of it?
>>>
>>> As for Doug's devil advocate questions - I really don't know what
>>> I'd use it for personally (other than the "match this single string
>>> against a bunch of queries"), I just thought it was clever that it
>>> could be done. Clever regex's could come close, but it'd be a lot
>>> more effort than reusing good ol' QueryParser and this simplistic
>>> IndexReader, along with an Analyzer.
>>>
>>> Erik
>>>
>>>>
>>>> Wolfgang.
>>>>
>>>>> I am intrigued by this and decided to mock a quick and dirty
>>>>> example of such an IndexReader. After a little trial-and-error I
>>>>> got it working at least for TermQuery and WildcardQuery. I've
>>>>> pasted my code below as an example, but there is much room for
>>>>> improvement, especially in terms of performance and also in
>>>>> keeping track of term frequency, and also it would be nicer if it
>>>>> handled the analysis internally.
>>>>>
>>>>> I think something like this would make a handy addition to our
>>>>> contrib area at least. I'd be happy to receive improvements to
>>>>> this and then add it to a contrib subproject.
>>>>>
>>>>> Perhaps this would be a handy way to handle situations where users
>>>>> have queries saved in a system and need to be alerted whenever a
>>>>> new document arrives matching the saved queries?
>>>>>
>>>>> Erik
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
>>>>>> Sent: Thursday, April 14, 2005 4:04 PM
>>>>>> To: java-dev@lucene.apache.org
>>>>>> Subject: Re: [Performance] Streaming main memory indexing of
>>>>>> single
>>>>>> strings
>>>>>>
>>>>>>
>>>>>> This seems to be a promising avenue worth exploring. My
>>>>>> gutfeeling is
>>>>>> that this could easily be 10-100 times faster.
>>>>>>
>>>>>> The drawback is that it requires a fair amount of understanding of
>>>>>> intricate Lucene internals, pulling those pieces together and
>>>>>> adapting
>>>>>> them as required for the seemingly simple "float match(String
>>>>>> text,
>>>>>> Query query)".
>>>>>>
>>>>>> I might give it a shot but I'm not sure I'll be able to pull this
>>>>>> off!
>>>>>> Is there any similar code I could look at as a starting point?
>>>>>>
>>>>>> Wolfgang.
>>>>>>
>>>>>> On Apr 14, 2005, at 1:13 PM, Robert Engels wrote:
>>>>>>
>>>>>>> I think you are not approaching this the correct way.
>>>>>>>
>>>>>>> Pseudo code:
>>>>>>>
>>>>>>> Subclass IndexReader.
>>>>>>>
>>>>>>> Get tokens from String 'document' using Lucene analyzers.
>>>>>>>
>>>>>>> Build simple hash-map based data structures using tokens for
>>>>>>> terms,
>>>>>>> and term
>>>>>>> positions.
>>>>>>>
>>>>>>> reimplement termDocs() and termPositions() to use the structures
>>>>>>> from
>>>>>>> above.
>>>>>>>
>>>>>>> run searches.
>>>>>>>
>>>>>>> start again with next document.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
>>>>>>> Sent: Thursday, April 14, 2005 2:56 PM
>>>>>>> To: java-dev@lucene.apache.org
>>>>>>> Subject: Re: [Performance] Streaming main memory indexing of
>>>>>>> single
>>>>>>> strings
>>>>>>>
>>>>>>>
>>>>>>> Otis, this might be a misunderstanding.
>>>>>>>
>>>>>>> - I'm not calling optimize(). That piece is commented out you if
>>>>>>> look
>>>>>>> again at the code.
>>>>>>> - The *streaming* use case requires that for each query I add
>>>>>>> one (and
>>>>>>> only one) document (aka string) to an empty index:
>>>>>>>
>>>>>>> repeat N times (where N is millions or billions):
>>>>>>> add a single string (aka document) to an empty index
>>>>>>> query the index
>>>>>>> drop index (or delete it's document)
>>>>>>>
>>>>>>> with the following API being called N times: float match(String
>>>>>>> text,
>>>>>>> Query query)
>>>>>>>
>>>>>>> So there's no possibility of adding many documents and thereafter
>>>>>>> running the query. This in turn seems to mean that the
>>>>>>> IndexWriter
>>>>>>> can't be kept open - unless I manually delete each document
>>>>>>> after each
>>>>>>> query to repeatedly reuse the RAMDirectory, which I've also tried
>>>>>>> before without any significant performance gain - deletion seems
>>>>>>> to
>>>>>>> have substantial overhead in itself. Perhaps it would be better
>>>>>>> if
>>>>>>> there were a Directory.deleteAllDocuments() or similar. Did you
>>>>>>> have
>>>>>>> some other approach in mind?
>>>>>>>
>>>>>>> As I said, Lucene's design doesn't seem to fit this streaming
>>>>>>> use case
>>>>>>> pattern well. In *this* scenario one could easily do without any
>>>>>>> locking, and without byte level organization in RAMDirectory and
>>>>>>> RAMFile, etc because a single small string isn't a large
>>>>>>> persistent
>>>>>>> multi-document index.
>>>>>>>
>>>>>>> For some background, here's a small example for the kind of
>>>>>>> XQuery
>>>>>>> functionality Nux/Lucene integration enables:
>>>>>>>
>>>>>>> (: An XQuery that finds all books authored by James that have
>>>>>>> something
>>>>>>> to do with "fish", sorted by relevance :)
>>>>>>> declare namespace lucene = "java:nux.xom.xquery.XQueryUtil";
>>>>>>> declare variable $query := "fish*~";
>>>>>>>
>>>>>>> for $book in /books/book[author="James" and
>>>>>>> lucene:match(string(.),
>>>>>>> $query) > 0.0]
>>>>>>> let $score := lucene:match(string($book), $query)
>>>>>>> order by $score descending
>>>>>>> return (<score>{$score}</score>, $book)
>>>>>>>
>>>>>>> More interestingly one can use this for classifying and routing
>>>>>>> XML
>>>>>>> messages based on rules (i.e. queries) inspecting their
>>>>>>> content...
>>>>>>>
>>>>>>> Any other clues about potential improvements would be greatly
>>>>>>> appreciated.
>>>>>>>
>>>>>>> Wolfgang.
>>>>>>>
>>>>>>> On Apr 13, 2005, at 10:09 PM, Otis Gospodnetic wrote:
>>>>>>>
>>>>>>>> It looks like you are calling that IndexWriter code in some
>>>>>>>> loops,
>>>>>>>> opening it and closing it in every iteration of the loop and
>>>>>>>> also
>>>>>>>> calling optimize. All of those things could be improved.
>>>>>>>> Keep your IndexWriter open, don't close it, and optimize the
>>>>>>>> index
>>>>>>>> only
>>>>>>>> once you are done adding documents to it.
>>>>>>>>
>>>>>>>> See the highlights and the snipets in the first hit:
>>>>>>>> http://www.lucenebook.com/search?query=when+to+optimize
>>>>>>>>
>>>>>>>> Otis
>>>>>>>>
>>>>>>>>
>>>>>>>> --- Wolfgang Hoschek <wh...@lbl.gov> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I'm wondering if anyone could let me know how to improve Lucene
>>>>>>>>> performance for "streaming main memory indexing of single
>>>>>>>>> strings".
>>>>>>>>> This would help to effectively integrate Lucene with the Nux
>>>>>>>>> XQuery
>>>>>>>>> engine.
>>>>>>>>>
>>>>>>>>> Below is a small microbenchmark simulating STREAMING XQuery
>>>>>>>>> fulltext
>>>>>>>>> search as typical for XML network routers, message queuing
>>>>>>>>> system,
>>>>>>>>> P2P
>>>>>>>>> networks, etc. In this on-the-fly main memory indexing
>>>>>>>>> scenario, each
>>>>>>>>>
>>>>>>>>> individual string is immediately matched as soon as it becomes
>>>>>>>>> available without any persistance involved. This usage
>>>>>>>>> scenario and
>>>>>>>>> corresponding performance profile is quite different in
>>>>>>>>> comparison to
>>>>>>>>>
>>>>>>>>> fulltext search over persistent (read-mostly) indexes.
>>>>>>>>>
>>>>>>>>> The benchmark runs at some 3000 lucene queries/sec
>>>>>>>>> (lucene-1.4.3)
>>>>>>>>> which
>>>>>>>>> is unfortunate news considering the XQuery engine can easily
>>>>>>>>> walk
>>>>>>>>> hundreds of thousands of XML nodes per second. Ideally I'd
>>>>>>>>> like to
>>>>>>>>> run
>>>>>>>>> at some 100000 queries/sec. Runnning this through the JDK 1.5
>>>>>>>>> profiler
>>>>>>>>> it seems that most time is spent in and below the following
>>>>>>>>> calls:
>>>>>>>>>
>>>>>>>>> writer = new IndexWriter(dir, analyzer, true);
>>>>>>>>> writer.addDocument(...);
>>>>>>>>> writer.close();
>>>>>>>>>
>>>>>>>>> I tried quite a few variants of the benchmark with various
>>>>>>>>> options,
>>>>>>>>> unfortunately with little or no effect.
>>>>>>>>> Lucene just does not seem to designed to do this sort of
>>>>>>>>> "transient
>>>>>>>>> single string index" thing. All code paths related to opening,
>>>>>>>>> closing,
>>>>>>>>> reading, writing, querying and object creation seem to be
>>>>>>>>> designed
>>>>>>>>> for
>>>>>>>>> large persistent indexes.
>>>>>>>>>
>>>>>>>>> Any advice on what I'm missing or what could be done about it
>>>>>>>>> would
>>>>>>>>> be
>>>>>>>>> greatly appreciated.
>>>>>>>>>
>>>>>>>>> Wolfgang.
>>>>>>>>>
>>>>>>>>> P.S. the benchmark code is attached as a file below:
>>>>>>>>>
>>>>>>>>>> package nux.xom.pool;
>>>>>>>>>
>>>>>>>>> import java.io.IOException;
>>>>>>>>> //import java.io.Reader;
>>>>>>>>>
>>>>>>>>> import org.apache.lucene.analysis.Analyzer;
>>>>>>>>> //import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>>>> //import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>>> //import org.apache.lucene.analysis.SimpleAnalyzer;
>>>>>>>>> //import org.apache.lucene.analysis.TokenStream;
>>>>>>>>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>>>>>>>>> import org.apache.lucene.document.Document;
>>>>>>>>> import org.apache.lucene.document.Field;
>>>>>>>>> //import org.apache.lucene.index.IndexReader;
>>>>>>>>> import org.apache.lucene.index.IndexWriter;
>>>>>>>>> import org.apache.lucene.queryParser.ParseException;
>>>>>>>>> import org.apache.lucene.queryParser.QueryParser;
>>>>>>>>> import org.apache.lucene.search.Hits;
>>>>>>>>> import org.apache.lucene.search.IndexSearcher;
>>>>>>>>> import org.apache.lucene.search.Query;
>>>>>>>>> import org.apache.lucene.search.Searcher;
>>>>>>>>> import org.apache.lucene.store.Directory;
>>>>>>>>> import org.apache.lucene.store.RAMDirectory;
>>>>>>>>>
>>>>>>>>> public final class LuceneMatcher { // TODO: make non-public
>>>>>>>>>
>>>>>>>>> private final Analyzer analyzer;
>>>>>>>>> // private final Directory dir = new RAMDirectory();
>>>>>>>>>
>>>>>>>>> public LuceneMatcher() {
>>>>>>>>> this(new StandardAnalyzer());
>>>>>>>>> // this(new SimpleAnalyzer());
>>>>>>>>> // this(new StopAnalyzer());
>>>>>>>>> // this(new Analyzer() {
>>>>>>>>> // public final TokenStream tokenStream(String fieldName,
>>>>>>>>> Reader
>>>>>>>>> reader) {
>>>>>>>>> // return new PorterStemFilter(new
>>>>>>>>> LowerCaseTokenizer(reader));
>>>>>>>>> // }
>>>>>>>>> // });
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> public LuceneMatcher(Analyzer analyzer) {
>>>>>>>>> if (analyzer == null)
>>>>>>>>> throw new IllegalArgumentException("analyzer must not be
>>>>>>>>> null");
>>>>>>>>> this.analyzer = analyzer;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> public Query parseQuery(String expression) throws
>>>>>>>>> ParseException {
>>>>>>>>> QueryParser parser = new QueryParser("content", analyzer);
>>>>>>>>> // parser.setPhraseSlop(0);
>>>>>>>>> return parser.parse(expression);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> /**
>>>>>>>>> * Returns the relevance score by matching the given index
>>>>>>>>> against
>>>>>>>>> the given
>>>>>>>>> * Lucene query expression. The index must not contain more
>>>>>>>>> than one
>>>>>>>>> Lucene
>>>>>>>>> * "document" (aka string to be searched).
>>>>>>>>> */
>>>>>>>>> public float match(Directory index, Query query) {
>>>>>>>>> Searcher searcher = null;
>>>>>>>>> try {
>>>>>>>>> searcher = new IndexSearcher(index);
>>>>>>>>> Hits hits = searcher.search(query);
>>>>>>>>> float score = hits.length() > 0 ? hits.score(0) : 0.0f;
>>>>>>>>> return score;
>>>>>>>>> } catch (IOException e) { // should never happen
>>>>>>>>> (RAMDirectory)
>>>>>>>>> throw new RuntimeException(e);
>>>>>>>>> } finally {
>>>>>>>>> try {
>>>>>>>>> if (searcher != null) searcher.close();
>>>>>>>>> } catch (IOException e) { // should never happen
>>>>>>>>> (RAMDirectory)
>>>>>>>>> throw new RuntimeException(e);
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> // public float match(String text, Query query) {
>>>>>>>>> // return match(createIndex(text), query);
>>>>>>>>> // }
>>>>>>>>>
>>>>>>>>> public Directory createIndex(String text) {
>>>>>>>>> Directory dir = new RAMDirectory();
>>>>>>>>> IndexWriter writer = null;
>>>>>>>>> try {
>>>>>>>>> writer = new IndexWriter(dir, analyzer, true);
>>>>>>>>> // writer.setUseCompoundFile(false);
>>>>>>>>> // writer.mergeFactor = 2;
>>>>>>>>> // writer.minMergeDocs = 1;
>>>>>>>>> // writer.maxMergeDocs = 1;
>>>>>>>>>
>>>>>>>>> writer.addDocument(createDocument(text));
>>>>>>>>> // writer.optimize();
>>>>>>>>> return dir;
>>>>>>>>> } catch (IOException e) { // should never happen
>>>>>>>>> (RAMDirectory)
>>>>>>>>> throw new RuntimeException(e);
>>>>>>>>> } finally {
>>>>>>>>> try {
>>>>>>>>> if (writer != null) writer.close();
>>>>>>>>> } catch (IOException e) { // should never happen
>>>>>>>>> (RAMDirectory)
>>>>>>>>> throw new RuntimeException(e);
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> private Document createDocument(String content) {
>>>>>>>>> Document doc = new Document();
>>>>>>>>> doc.add(Field.UnStored("content", content));
>>>>>>>>> // doc.add(Field.Text("x", content));
>>>>>>>>> return doc;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> /**
>>>>>>>>> * Lucene microbenchmark simulating STREAMING XQuery fulltext
>>>>>>>>> search
>>>>>>>>> as
>>>>>>>>> * typical for XML network routers, message queuing system,
>>>>>>>>> P2P
>>>>>>>>> networks,
>>>>>>>>> * etc. In this on-the-fly main memory indexing scenario, each
>>>>>>>>> individual
>>>>>>>>> * string is immediately matched as soon as it becomes
>>>>>>>>> available
>>>>>>>>> without any
>>>>>>>>> * persistance involved. This usage scenario and corresponding
>>>>>>>>> performance
>>>>>>>>> * profile is quite different in comparison to fulltext
>>>>>>>>> search over
>>>>>>>>> * persistent (read-mostly) indexes.
>>>>>>>>> *
>>>>>>>>> * Example XPath:
>>>>>>>>> count(/table/row[lucene:match(string(./firstname),
>>>>>>>>> * "James") > 0.0])
>>>>>>>>> */
>>>>>>>>> public static void main(String[] args) throws Exception {
>>>>>>>>> int k = -1;
>>>>>>>>> int runs = 5;
>>>>>>>>> if (args.length > ++k) runs = Integer.parseInt(args[k]);
>>>>>>>>>
>>>>>>>>> int nodes = 10000;
>>>>>>>>> if (args.length > ++k) nodes = Integer.parseInt(args[k]);
>>>>>>>>>
>>>>>>>>> String content = "James is out in the woods";
>>>>>>>>> if (args.length > ++k) content = args[k];
>>>>>>>>>
>>>>>>>>> String expression = "James";
>>>>>>>>> if (args.length > ++k) expression = args[k];
>>>>>>>>>
>>>>>>>>> LuceneMatcher matcher = new LuceneMatcher();
>>>>>>>>> Query query = matcher.parseQuery(expression); // to be
>>>>>>>>> reused N
>>>>>>>>> times
>>>>>>>>>
>>>>>>>>> for (int r = 0; r < runs; r++) {
>>>>>>>>> long start = System.currentTimeMillis();
>>>>>>>>> int matches = 0;
>>>>>>>>>
>>>>>>>>> for (int i = 0; i < nodes; i++) {
>>>>>>>>> // if (LuceneUtil.match(content + i, expression) > 0.0f) {
>>>>>>>>> if (matcher.match(matcher.createIndex(content + i), query)
>>>>>>>>> >
>>>>>>>>> 0.0f) {
>>>>>>>>> matches++;
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> long end = System.currentTimeMillis();
>>>>>>>>> System.out.println("matches=" + matches);
>>>>>>>>> System.out.println("secs=" + ((end-start) / 1000.0f));
>>>>>>>>> System.out.println("queries/sec=" + (nodes / ((end-start) /
>>>>>>>>> 1000.0f)));
>>>>>>>>> System.out.println();
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>> }
>>>>>
>>>>> public class StringIndexReader extends IndexReader {
>>>>> private List terms;
>>>>> public StringIndexReader(String strings[]) {
>>>>> super(null);
>>>>> terms = Arrays.asList(strings);
>>>>> Collections.sort(terms);
>>>>> }
>>>>>
>>>>> public TermFreqVector[] getTermFreqVectors(int docNumber) throws
>>>>> IOException {
>>>>> System.out.println("StringIndexReader.getTermFreqVectors");
>>>>> return new TermFreqVector[0];
>>>>> }
>>>>>
>>>>> public TermFreqVector getTermFreqVector(int docNumber, String
>>>>> field) throws IOException {
>>>>> System.out.println("StringIndexReader.getTermFreqVector");
>>>>> return null;
>>>>> }
>>>>>
>>>>> public int numDocs() {
>>>>> System.out.println("StringIndexReader.numDocs");
>>>>> return 1;
>>>>> }
>>>>>
>>>>> public int maxDoc() {
>>>>> System.out.println("StringIndexReader.maxDoc");
>>>>> return 1;
>>>>> }
>>>>>
>>>>> public Document document(int n) throws IOException {
>>>>> System.out.println("StringIndexReader.document");
>>>>> return null;
>>>>> }
>>>>>
>>>>> public boolean isDeleted(int n) {
>>>>> System.out.println("StringIndexReader.isDeleted");
>>>>> return false;
>>>>> }
>>>>>
>>>>> public boolean hasDeletions() {
>>>>> System.out.println("StringIndexReader.hasDeletions");
>>>>> return false;
>>>>> }
>>>>>
>>>>> public byte[] norms(String field) throws IOException {
>>>>> // TODO: what value to use for this?
>>>>> System.out.println("StringIndexReader.norms: " + field);
>>>>> return new byte[] { 1 };
>>>>> }
>>>>>
>>>>> public void norms(String field, byte[] bytes, int offset) throws
>>>>> IOException {
>>>>> System.out.println("StringIndexReader.norms: " + field + "*");
>>>>> }
>>>>>
>>>>> protected void doSetNorm(int doc, String field, byte value)
>>>>> throws IOException {
>>>>> System.out.println("StringIndexReader.doSetNorm");
>>>>>
>>>>> }
>>>>>
>>>>> public TermEnum terms() throws IOException {
>>>>> System.out.println("StringIndexReader.terms");
>>>>> return terms(null);
>>>>> }
>>>>>
>>>>> public TermEnum terms(final Term term) throws IOException {
>>>>> System.out.println("StringIndexReader.terms: " + term);
>>>>>
>>>>> TermEnum termEnum = new TermEnum() {
>>>>> private String currentTerm;
>>>>> private Iterator iter;
>>>>>
>>>>> public boolean next() {
>>>>> System.out.println("TermEnum.next");
>>>>> if (iter.hasNext())
>>>>> currentTerm = (String) iter.next();
>>>>> return iter.hasNext();
>>>>> }
>>>>>
>>>>> public Term term() {
>>>>> if (iter == null) {
>>>>> iter = terms.iterator();
>>>>> while (next()) {
>>>>> if (currentTerm.startsWith(term.text()))
>>>>> break;
>>>>> }
>>>>> }
>>>>> System.out.println("TermEnum.term: " + currentTerm);
>>>>> return new Term(term.field(), currentTerm);
>>>>> }
>>>>>
>>>>> public int docFreq() {
>>>>> System.out.println("TermEnum.docFreq");
>>>>> return 1;
>>>>> }
>>>>>
>>>>> public void close() {
>>>>> System.out.println("TermEnum.close");
>>>>> }
>>>>> };
>>>>> return termEnum;
>>>>> }
>>>>>
>>>>> public int docFreq(Term term) throws IOException {
>>>>> System.out.println("StringIndexReader.docFreq: " + term);
>>>>> return terms.contains(term.text()) ? 1 : 0;
>>>>> }
>>>>>
>>>>> public TermDocs termDocs() throws IOException {
>>>>> System.out.println("StringIndexReader.termDocs");
>>>>>
>>>>> TermDocs td = new TermDocs() {
>>>>> private boolean done = false;
>>>>> String currentTerm;
>>>>>
>>>>> public void seek(Term term) {
>>>>> System.out.println(".seek: " + term);
>>>>> currentTerm = term.text();
>>>>> done = false;
>>>>> }
>>>>>
>>>>> public void seek(TermEnum termEnum) {
>>>>> seek(termEnum.term());
>>>>> }
>>>>>
>>>>> public int doc() {
>>>>> System.out.println(".doc");
>>>>> return 0;
>>>>> }
>>>>>
>>>>> public int freq() {
>>>>> System.out.println(".freq");
>>>>> return 1;
>>>>> }
>>>>>
>>>>> public boolean next() {
>>>>> System.out.println(".next");
>>>>> return false;
>>>>> }
>>>>>
>>>>> public int read(int[] docs, int[] freqs) {
>>>>> System.out.println(".read: " + docs.length);
>>>>>
>>>>> if (done) return 0;
>>>>>
>>>>> done = true;
>>>>> docs[0] = 0;
>>>>> freqs[0] = freq();
>>>>> return 1;
>>>>> }
>>>>>
>>>>> public boolean skipTo(int target) {
>>>>> System.out.println(".skipTo");
>>>>> return false;
>>>>> }
>>>>>
>>>>> public void close() {
>>>>> System.out.println(".close");
>>>>>
>>>>> }
>>>>> };
>>>>> return td;
>>>>> }
>>>>>
>>>>> public TermPositions termPositions() throws IOException {
>>>>> System.out.println("StringIndexReader.termPositions");
>>>>> return null;
>>>>> }
>>>>>
>>>>> protected void doDelete(int docNum) throws IOException {
>>>>> System.out.println("StringIndexReader.doDelete");
>>>>>
>>>>> }
>>>>>
>>>>> protected void doUndeleteAll() throws IOException {
>>>>> System.out.println("StringIndexReader.doUndeleteAll");
>>>>>
>>>>> }
>>>>>
>>>>> protected void doCommit() throws IOException {
>>>>> System.out.println("StringIndexReader.doCommit");
>>>>>
>>>>> }
>>>>>
>>>>> protected void doClose() throws IOException {
>>>>> System.out.println("StringIndexReader.doClose");
>>>>>
>>>>> }
>>>>>
>>>>> public Collection getFieldNames() throws IOException {
>>>>> System.out.println("StringIndexReader.getFieldNames");
>>>>> return null;
>>>>> }
>>>>>
>>>>> public Collection getFieldNames(boolean indexed) throws
>>>>> IOException {
>>>>> System.out.println("StringIndexReader.getFieldNames");
>>>>> return null;
>>>>> }
>>>>>
>>>>> public Collection getIndexedFieldNames(Field.TermVector tvSpec) {
>>>>> System.out.println("StringIndexReader.getIndexedFieldNames");
>>>>> return null;
>>>>> }
>>>>>
>>>>> public Collection getFieldNames(FieldOption fldOption) {
>>>>> System.out.println("StringIndexReader.getFieldNames");
>>>>> return null;
>>>>> }
>>>>>
>>>>> public static void main(String[] args) {
>>>>> IndexReader reader = new StringIndexReader(new String[]
>>>>> {"foo", "bar", "baz"});
>>>>> IndexSearcher searcher = new IndexSearcher(reader);
>>>>>
>>>>> Hits hits = null;
>>>>> try {
>>>>> hits = searcher.search(new WildcardQuery(new
>>>>> Term("field","ba*")));
>>>>> } catch (IOException e) {
>>>>> e.printStackTrace();
>>>>> }
>>>>> System.out.println("found " + hits.length());
>>>>> }
>>>>> }
>>>>>
>>>>>
>>>>> -------------------------------------------------------------------
>>>>> --
>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>
>>>>
>>>> --------------------------------------------------------------------
>>>> -
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Apr 15, 2005, at 8:18 PM, Wolfgang Hoschek wrote:
> The main issue is to enable handling arbitrary queries (anything
> derived from o.a.l.search.Query). Yes, there'd be an additional method
> Analyzer parameter (support any analyzer). The use case does not
> require field names. One could internally use "content" or anything
> else for the default field name, which is the one to implicitly be
> queried...
To be general-purpose, though, passing in the default field name would
be needed. All of Lucene's built-in analyzers ignore it, but that is
not the case everywhere. For example, NutchAnalyzer does different
things for different fields.
So, all the text analyzed is in a given field... that means that
anything in the Query not associated with that field has no bearing on
whether the text matches or not, correct?
There is still some missing trickery in my StringIndexReader - it does
not currently handle phrase queries as an implementation of
termPositions() is needed.
Wolfgang - will you take what I've done the extra mile and implement
what's left (frequency and term position)? I might not revisit this
very soon.
Erik
>
> Wolfgang.
>
> On Apr 15, 2005, at 5:08 PM, Erik Hatcher wrote:
>
>> On Apr 15, 2005, at 6:15 PM, Wolfgang Hoschek wrote:
>>> Cool! For my use case it would need to be able to handle arbitrary
>>> queries (previously parsed from a general lucene query string).
>>> Something like:
>>>
>>> float match(String Text, Query query)
>>>
>>> it's fine with me if it also works for
>>>
>>> float[] match(String[] texts, Query query) or
>>> float(Document doc, Query query)
>>>
>>> but that isn't required by the use case.
>>
>> My implementation is nearly that. The score is available as
>> hits.score(0). You would also need an analyzer, I presume, passed to
>> your proposed match() method if you want the text broken into terms.
>> My current implementation is passed a String[] where each item is
>> considered a term for the document. match() would also need a field
>> name to be fully accurate - since the analyzer needs a field name and
>> terms used for searching need a field name. The Query may contain
>> terms for any number of fields - how should that be handled? Should
>> only a single field name be passed in and any terms request for other
>> fields be ignored? Or should this utility morph to assume any words
>> in the text is in any field being asked of it?
>>
>> As for Doug's devil advocate questions - I really don't know what I'd
>> use it for personally (other than the "match this single string
>> against a bunch of queries"), I just thought it was clever that it
>> could be done. Clever regex's could come close, but it'd be a lot
>> more effort than reusing good ol' QueryParser and this simplistic
>> IndexReader, along with an Analyzer.
>>
>> Erik
>>
>>>
>>> Wolfgang.
>>>
>>>> I am intrigued by this and decided to mock a quick and dirty
>>>> example of such an IndexReader. After a little trial-and-error I
>>>> got it working at least for TermQuery and WildcardQuery. I've
>>>> pasted my code below as an example, but there is much room for
>>>> improvement, especially in terms of performance and also in keeping
>>>> track of term frequency, and also it would be nicer if it handled
>>>> the analysis internally.
>>>>
>>>> I think something like this would make a handy addition to our
>>>> contrib area at least. I'd be happy to receive improvements to
>>>> this and then add it to a contrib subproject.
>>>>
>>>> Perhaps this would be a handy way to handle situations where users
>>>> have queries saved in a system and need to be alerted whenever a
>>>> new document arrives matching the saved queries?
>>>>
>>>> Erik
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
>>>>> Sent: Thursday, April 14, 2005 4:04 PM
>>>>> To: java-dev@lucene.apache.org
>>>>> Subject: Re: [Performance] Streaming main memory indexing of single
>>>>> strings
>>>>>
>>>>>
>>>>> This seems to be a promising avenue worth exploring. My gutfeeling
>>>>> is
>>>>> that this could easily be 10-100 times faster.
>>>>>
>>>>> The drawback is that it requires a fair amount of understanding of
>>>>> intricate Lucene internals, pulling those pieces together and
>>>>> adapting
>>>>> them as required for the seemingly simple "float match(String text,
>>>>> Query query)".
>>>>>
>>>>> I might give it a shot but I'm not sure I'll be able to pull this
>>>>> off!
>>>>> Is there any similar code I could look at as a starting point?
>>>>>
>>>>> Wolfgang.
>>>>>
>>>>> On Apr 14, 2005, at 1:13 PM, Robert Engels wrote:
>>>>>
>>>>>> I think you are not approaching this the correct way.
>>>>>>
>>>>>> Pseudo code:
>>>>>>
>>>>>> Subclass IndexReader.
>>>>>>
>>>>>> Get tokens from String 'document' using Lucene analyzers.
>>>>>>
>>>>>> Build simple hash-map based data structures using tokens for
>>>>>> terms,
>>>>>> and term
>>>>>> positions.
>>>>>>
>>>>>> reimplement termDocs() and termPositions() to use the structures
>>>>>> from
>>>>>> above.
>>>>>>
>>>>>> run searches.
>>>>>>
>>>>>> start again with next document.
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
>>>>>> Sent: Thursday, April 14, 2005 2:56 PM
>>>>>> To: java-dev@lucene.apache.org
>>>>>> Subject: Re: [Performance] Streaming main memory indexing of
>>>>>> single
>>>>>> strings
>>>>>>
>>>>>>
>>>>>> Otis, this might be a misunderstanding.
>>>>>>
>>>>>> - I'm not calling optimize(). That piece is commented out you if
>>>>>> look
>>>>>> again at the code.
>>>>>> - The *streaming* use case requires that for each query I add one
>>>>>> (and
>>>>>> only one) document (aka string) to an empty index:
>>>>>>
>>>>>> repeat N times (where N is millions or billions):
>>>>>> add a single string (aka document) to an empty index
>>>>>> query the index
>>>>>> drop index (or delete it's document)
>>>>>>
>>>>>> with the following API being called N times: float match(String
>>>>>> text,
>>>>>> Query query)
>>>>>>
>>>>>> So there's no possibility of adding many documents and thereafter
>>>>>> running the query. This in turn seems to mean that the IndexWriter
>>>>>> can't be kept open - unless I manually delete each document after
>>>>>> each
>>>>>> query to repeatedly reuse the RAMDirectory, which I've also tried
>>>>>> before without any significant performance gain - deletion seems
>>>>>> to
>>>>>> have substantial overhead in itself. Perhaps it would be better if
>>>>>> there were a Directory.deleteAllDocuments() or similar. Did you
>>>>>> have
>>>>>> some other approach in mind?
>>>>>>
>>>>>> As I said, Lucene's design doesn't seem to fit this streaming use
>>>>>> case
>>>>>> pattern well. In *this* scenario one could easily do without any
>>>>>> locking, and without byte level organization in RAMDirectory and
>>>>>> RAMFile, etc because a single small string isn't a large
>>>>>> persistent
>>>>>> multi-document index.
>>>>>>
>>>>>> For some background, here's a small example for the kind of XQuery
>>>>>> functionality Nux/Lucene integration enables:
>>>>>>
>>>>>> (: An XQuery that finds all books authored by James that have
>>>>>> something
>>>>>> to do with "fish", sorted by relevance :)
>>>>>> declare namespace lucene = "java:nux.xom.xquery.XQueryUtil";
>>>>>> declare variable $query := "fish*~";
>>>>>>
>>>>>> for $book in /books/book[author="James" and
>>>>>> lucene:match(string(.),
>>>>>> $query) > 0.0]
>>>>>> let $score := lucene:match(string($book), $query)
>>>>>> order by $score descending
>>>>>> return (<score>{$score}</score>, $book)
>>>>>>
>>>>>> More interestingly one can use this for classifying and routing
>>>>>> XML
>>>>>> messages based on rules (i.e. queries) inspecting their content...
>>>>>>
>>>>>> Any other clues about potential improvements would be greatly
>>>>>> appreciated.
>>>>>>
>>>>>> Wolfgang.
>>>>>>
>>>>>> On Apr 13, 2005, at 10:09 PM, Otis Gospodnetic wrote:
>>>>>>
>>>>>>> It looks like you are calling that IndexWriter code in some
>>>>>>> loops,
>>>>>>> opening it and closing it in every iteration of the loop and also
>>>>>>> calling optimize. All of those things could be improved.
>>>>>>> Keep your IndexWriter open, don't close it, and optimize the
>>>>>>> index
>>>>>>> only
>>>>>>> once you are done adding documents to it.
>>>>>>>
>>>>>>> See the highlights and the snipets in the first hit:
>>>>>>> http://www.lucenebook.com/search?query=when+to+optimize
>>>>>>>
>>>>>>> Otis
>>>>>>>
>>>>>>>
>>>>>>> --- Wolfgang Hoschek <wh...@lbl.gov> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm wondering if anyone could let me know how to improve Lucene
>>>>>>>> performance for "streaming main memory indexing of single
>>>>>>>> strings".
>>>>>>>> This would help to effectively integrate Lucene with the Nux
>>>>>>>> XQuery
>>>>>>>> engine.
>>>>>>>>
>>>>>>>> Below is a small microbenchmark simulating STREAMING XQuery
>>>>>>>> fulltext
>>>>>>>> search as typical for XML network routers, message queuing
>>>>>>>> system,
>>>>>>>> P2P
>>>>>>>> networks, etc. In this on-the-fly main memory indexing
>>>>>>>> scenario, each
>>>>>>>>
>>>>>>>> individual string is immediately matched as soon as it becomes
>>>>>>>> available without any persistance involved. This usage scenario
>>>>>>>> and
>>>>>>>> corresponding performance profile is quite different in
>>>>>>>> comparison to
>>>>>>>>
>>>>>>>> fulltext search over persistent (read-mostly) indexes.
>>>>>>>>
>>>>>>>> The benchmark runs at some 3000 lucene queries/sec
>>>>>>>> (lucene-1.4.3)
>>>>>>>> which
>>>>>>>> is unfortunate news considering the XQuery engine can easily
>>>>>>>> walk
>>>>>>>> hundreds of thousands of XML nodes per second. Ideally I'd like
>>>>>>>> to
>>>>>>>> run
>>>>>>>> at some 100000 queries/sec. Runnning this through the JDK 1.5
>>>>>>>> profiler
>>>>>>>> it seems that most time is spent in and below the following
>>>>>>>> calls:
>>>>>>>>
>>>>>>>> writer = new IndexWriter(dir, analyzer, true);
>>>>>>>> writer.addDocument(...);
>>>>>>>> writer.close();
>>>>>>>>
>>>>>>>> I tried quite a few variants of the benchmark with various
>>>>>>>> options,
>>>>>>>> unfortunately with little or no effect.
>>>>>>>> Lucene just does not seem to designed to do this sort of
>>>>>>>> "transient
>>>>>>>> single string index" thing. All code paths related to opening,
>>>>>>>> closing,
>>>>>>>> reading, writing, querying and object creation seem to be
>>>>>>>> designed
>>>>>>>> for
>>>>>>>> large persistent indexes.
>>>>>>>>
>>>>>>>> Any advice on what I'm missing or what could be done about it
>>>>>>>> would
>>>>>>>> be
>>>>>>>> greatly appreciated.
>>>>>>>>
>>>>>>>> Wolfgang.
>>>>>>>>
>>>>>>>> P.S. the benchmark code is attached as a file below:
>>>>>>>>
>>>>>>>>> package nux.xom.pool;
>>>>>>>>
>>>>>>>> import java.io.IOException;
>>>>>>>> //import java.io.Reader;
>>>>>>>>
>>>>>>>> import org.apache.lucene.analysis.Analyzer;
>>>>>>>> //import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>>> //import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>>> //import org.apache.lucene.analysis.SimpleAnalyzer;
>>>>>>>> //import org.apache.lucene.analysis.TokenStream;
>>>>>>>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>>>>>>>> import org.apache.lucene.document.Document;
>>>>>>>> import org.apache.lucene.document.Field;
>>>>>>>> //import org.apache.lucene.index.IndexReader;
>>>>>>>> import org.apache.lucene.index.IndexWriter;
>>>>>>>> import org.apache.lucene.queryParser.ParseException;
>>>>>>>> import org.apache.lucene.queryParser.QueryParser;
>>>>>>>> import org.apache.lucene.search.Hits;
>>>>>>>> import org.apache.lucene.search.IndexSearcher;
>>>>>>>> import org.apache.lucene.search.Query;
>>>>>>>> import org.apache.lucene.search.Searcher;
>>>>>>>> import org.apache.lucene.store.Directory;
>>>>>>>> import org.apache.lucene.store.RAMDirectory;
>>>>>>>>
>>>>>>>> public final class LuceneMatcher { // TODO: make non-public
>>>>>>>>
>>>>>>>> private final Analyzer analyzer;
>>>>>>>> // private final Directory dir = new RAMDirectory();
>>>>>>>>
>>>>>>>> public LuceneMatcher() {
>>>>>>>> this(new StandardAnalyzer());
>>>>>>>> // this(new SimpleAnalyzer());
>>>>>>>> // this(new StopAnalyzer());
>>>>>>>> // this(new Analyzer() {
>>>>>>>> // public final TokenStream tokenStream(String fieldName,
>>>>>>>> Reader
>>>>>>>> reader) {
>>>>>>>> // return new PorterStemFilter(new
>>>>>>>> LowerCaseTokenizer(reader));
>>>>>>>> // }
>>>>>>>> // });
>>>>>>>> }
>>>>>>>>
>>>>>>>> public LuceneMatcher(Analyzer analyzer) {
>>>>>>>> if (analyzer == null)
>>>>>>>> throw new IllegalArgumentException("analyzer must not be
>>>>>>>> null");
>>>>>>>> this.analyzer = analyzer;
>>>>>>>> }
>>>>>>>>
>>>>>>>> public Query parseQuery(String expression) throws
>>>>>>>> ParseException {
>>>>>>>> QueryParser parser = new QueryParser("content", analyzer);
>>>>>>>> // parser.setPhraseSlop(0);
>>>>>>>> return parser.parse(expression);
>>>>>>>> }
>>>>>>>>
>>>>>>>> /**
>>>>>>>> * Returns the relevance score by matching the given index
>>>>>>>> against
>>>>>>>> the given
>>>>>>>> * Lucene query expression. The index must not contain more
>>>>>>>> than one
>>>>>>>> Lucene
>>>>>>>> * "document" (aka string to be searched).
>>>>>>>> */
>>>>>>>> public float match(Directory index, Query query) {
>>>>>>>> Searcher searcher = null;
>>>>>>>> try {
>>>>>>>> searcher = new IndexSearcher(index);
>>>>>>>> Hits hits = searcher.search(query);
>>>>>>>> float score = hits.length() > 0 ? hits.score(0) : 0.0f;
>>>>>>>> return score;
>>>>>>>> } catch (IOException e) { // should never happen
>>>>>>>> (RAMDirectory)
>>>>>>>> throw new RuntimeException(e);
>>>>>>>> } finally {
>>>>>>>> try {
>>>>>>>> if (searcher != null) searcher.close();
>>>>>>>> } catch (IOException e) { // should never happen
>>>>>>>> (RAMDirectory)
>>>>>>>> throw new RuntimeException(e);
>>>>>>>> }
>>>>>>>> }
>>>>>>>> }
>>>>>>>>
>>>>>>>> // public float match(String text, Query query) {
>>>>>>>> // return match(createIndex(text), query);
>>>>>>>> // }
>>>>>>>>
>>>>>>>> public Directory createIndex(String text) {
>>>>>>>> Directory dir = new RAMDirectory();
>>>>>>>> IndexWriter writer = null;
>>>>>>>> try {
>>>>>>>> writer = new IndexWriter(dir, analyzer, true);
>>>>>>>> // writer.setUseCompoundFile(false);
>>>>>>>> // writer.mergeFactor = 2;
>>>>>>>> // writer.minMergeDocs = 1;
>>>>>>>> // writer.maxMergeDocs = 1;
>>>>>>>>
>>>>>>>> writer.addDocument(createDocument(text));
>>>>>>>> // writer.optimize();
>>>>>>>> return dir;
>>>>>>>> } catch (IOException e) { // should never happen
>>>>>>>> (RAMDirectory)
>>>>>>>> throw new RuntimeException(e);
>>>>>>>> } finally {
>>>>>>>> try {
>>>>>>>> if (writer != null) writer.close();
>>>>>>>> } catch (IOException e) { // should never happen
>>>>>>>> (RAMDirectory)
>>>>>>>> throw new RuntimeException(e);
>>>>>>>> }
>>>>>>>> }
>>>>>>>> }
>>>>>>>>
>>>>>>>> private Document createDocument(String content) {
>>>>>>>> Document doc = new Document();
>>>>>>>> doc.add(Field.UnStored("content", content));
>>>>>>>> // doc.add(Field.Text("x", content));
>>>>>>>> return doc;
>>>>>>>> }
>>>>>>>>
>>>>>>>> /**
>>>>>>>> * Lucene microbenchmark simulating STREAMING XQuery fulltext
>>>>>>>> search
>>>>>>>> as
>>>>>>>> * typical for XML network routers, message queuing system, P2P
>>>>>>>> networks,
>>>>>>>> * etc. In this on-the-fly main memory indexing scenario, each
>>>>>>>> individual
>>>>>>>> * string is immediately matched as soon as it becomes
>>>>>>>> available
>>>>>>>> without any
>>>>>>>> * persistance involved. This usage scenario and corresponding
>>>>>>>> performance
>>>>>>>> * profile is quite different in comparison to fulltext search
>>>>>>>> over
>>>>>>>> * persistent (read-mostly) indexes.
>>>>>>>> *
>>>>>>>> * Example XPath:
>>>>>>>> count(/table/row[lucene:match(string(./firstname),
>>>>>>>> * "James") > 0.0])
>>>>>>>> */
>>>>>>>> public static void main(String[] args) throws Exception {
>>>>>>>> int k = -1;
>>>>>>>> int runs = 5;
>>>>>>>> if (args.length > ++k) runs = Integer.parseInt(args[k]);
>>>>>>>>
>>>>>>>> int nodes = 10000;
>>>>>>>> if (args.length > ++k) nodes = Integer.parseInt(args[k]);
>>>>>>>>
>>>>>>>> String content = "James is out in the woods";
>>>>>>>> if (args.length > ++k) content = args[k];
>>>>>>>>
>>>>>>>> String expression = "James";
>>>>>>>> if (args.length > ++k) expression = args[k];
>>>>>>>>
>>>>>>>> LuceneMatcher matcher = new LuceneMatcher();
>>>>>>>> Query query = matcher.parseQuery(expression); // to be reused
>>>>>>>> N
>>>>>>>> times
>>>>>>>>
>>>>>>>> for (int r = 0; r < runs; r++) {
>>>>>>>> long start = System.currentTimeMillis();
>>>>>>>> int matches = 0;
>>>>>>>>
>>>>>>>> for (int i = 0; i < nodes; i++) {
>>>>>>>> // if (LuceneUtil.match(content + i, expression) > 0.0f) {
>>>>>>>> if (matcher.match(matcher.createIndex(content + i), query) >
>>>>>>>> 0.0f) {
>>>>>>>> matches++;
>>>>>>>> }
>>>>>>>> }
>>>>>>>>
>>>>>>>> long end = System.currentTimeMillis();
>>>>>>>> System.out.println("matches=" + matches);
>>>>>>>> System.out.println("secs=" + ((end-start) / 1000.0f));
>>>>>>>> System.out.println("queries/sec=" + (nodes / ((end-start) /
>>>>>>>> 1000.0f)));
>>>>>>>> System.out.println();
>>>>>>>> }
>>>>>>>> }
>>>>>>>> }
>>>>
>>>> public class StringIndexReader extends IndexReader {
>>>> private List terms;
>>>> public StringIndexReader(String strings[]) {
>>>> super(null);
>>>> terms = Arrays.asList(strings);
>>>> Collections.sort(terms);
>>>> }
>>>>
>>>> public TermFreqVector[] getTermFreqVectors(int docNumber) throws
>>>> IOException {
>>>> System.out.println("StringIndexReader.getTermFreqVectors");
>>>> return new TermFreqVector[0];
>>>> }
>>>>
>>>> public TermFreqVector getTermFreqVector(int docNumber, String
>>>> field) throws IOException {
>>>> System.out.println("StringIndexReader.getTermFreqVector");
>>>> return null;
>>>> }
>>>>
>>>> public int numDocs() {
>>>> System.out.println("StringIndexReader.numDocs");
>>>> return 1;
>>>> }
>>>>
>>>> public int maxDoc() {
>>>> System.out.println("StringIndexReader.maxDoc");
>>>> return 1;
>>>> }
>>>>
>>>> public Document document(int n) throws IOException {
>>>> System.out.println("StringIndexReader.document");
>>>> return null;
>>>> }
>>>>
>>>> public boolean isDeleted(int n) {
>>>> System.out.println("StringIndexReader.isDeleted");
>>>> return false;
>>>> }
>>>>
>>>> public boolean hasDeletions() {
>>>> System.out.println("StringIndexReader.hasDeletions");
>>>> return false;
>>>> }
>>>>
>>>> public byte[] norms(String field) throws IOException {
>>>> // TODO: what value to use for this?
>>>> System.out.println("StringIndexReader.norms: " + field);
>>>> return new byte[] { 1 };
>>>> }
>>>>
>>>> public void norms(String field, byte[] bytes, int offset) throws
>>>> IOException {
>>>> System.out.println("StringIndexReader.norms: " + field + "*");
>>>> }
>>>>
>>>> protected void doSetNorm(int doc, String field, byte value)
>>>> throws IOException {
>>>> System.out.println("StringIndexReader.doSetNorm");
>>>>
>>>> }
>>>>
>>>> public TermEnum terms() throws IOException {
>>>> System.out.println("StringIndexReader.terms");
>>>> return terms(null);
>>>> }
>>>>
>>>> public TermEnum terms(final Term term) throws IOException {
>>>> System.out.println("StringIndexReader.terms: " + term);
>>>>
>>>> TermEnum termEnum = new TermEnum() {
>>>> private String currentTerm;
>>>> private Iterator iter;
>>>>
>>>> public boolean next() {
>>>> System.out.println("TermEnum.next");
>>>> if (iter.hasNext())
>>>> currentTerm = (String) iter.next();
>>>> return iter.hasNext();
>>>> }
>>>>
>>>> public Term term() {
>>>> if (iter == null) {
>>>> iter = terms.iterator();
>>>> while (next()) {
>>>> if (currentTerm.startsWith(term.text()))
>>>> break;
>>>> }
>>>> }
>>>> System.out.println("TermEnum.term: " + currentTerm);
>>>> return new Term(term.field(), currentTerm);
>>>> }
>>>>
>>>> public int docFreq() {
>>>> System.out.println("TermEnum.docFreq");
>>>> return 1;
>>>> }
>>>>
>>>> public void close() {
>>>> System.out.println("TermEnum.close");
>>>> }
>>>> };
>>>> return termEnum;
>>>> }
>>>>
>>>> public int docFreq(Term term) throws IOException {
>>>> System.out.println("StringIndexReader.docFreq: " + term);
>>>> return terms.contains(term.text()) ? 1 : 0;
>>>> }
>>>>
>>>> public TermDocs termDocs() throws IOException {
>>>> System.out.println("StringIndexReader.termDocs");
>>>>
>>>> TermDocs td = new TermDocs() {
>>>> private boolean done = false;
>>>> String currentTerm;
>>>>
>>>> public void seek(Term term) {
>>>> System.out.println(".seek: " + term);
>>>> currentTerm = term.text();
>>>> done = false;
>>>> }
>>>>
>>>> public void seek(TermEnum termEnum) {
>>>> seek(termEnum.term());
>>>> }
>>>>
>>>> public int doc() {
>>>> System.out.println(".doc");
>>>> return 0;
>>>> }
>>>>
>>>> public int freq() {
>>>> System.out.println(".freq");
>>>> return 1;
>>>> }
>>>>
>>>> public boolean next() {
>>>> System.out.println(".next");
>>>> return false;
>>>> }
>>>>
>>>> public int read(int[] docs, int[] freqs) {
>>>> System.out.println(".read: " + docs.length);
>>>>
>>>> if (done) return 0;
>>>>
>>>> done = true;
>>>> docs[0] = 0;
>>>> freqs[0] = freq();
>>>> return 1;
>>>> }
>>>>
>>>> public boolean skipTo(int target) {
>>>> System.out.println(".skipTo");
>>>> return false;
>>>> }
>>>>
>>>> public void close() {
>>>> System.out.println(".close");
>>>>
>>>> }
>>>> };
>>>> return td;
>>>> }
>>>>
>>>> public TermPositions termPositions() throws IOException {
>>>> System.out.println("StringIndexReader.termPositions");
>>>> return null;
>>>> }
>>>>
>>>> protected void doDelete(int docNum) throws IOException {
>>>> System.out.println("StringIndexReader.doDelete");
>>>>
>>>> }
>>>>
>>>> protected void doUndeleteAll() throws IOException {
>>>> System.out.println("StringIndexReader.doUndeleteAll");
>>>>
>>>> }
>>>>
>>>> protected void doCommit() throws IOException {
>>>> System.out.println("StringIndexReader.doCommit");
>>>>
>>>> }
>>>>
>>>> protected void doClose() throws IOException {
>>>> System.out.println("StringIndexReader.doClose");
>>>>
>>>> }
>>>>
>>>> public Collection getFieldNames() throws IOException {
>>>> System.out.println("StringIndexReader.getFieldNames");
>>>> return null;
>>>> }
>>>>
>>>> public Collection getFieldNames(boolean indexed) throws
>>>> IOException {
>>>> System.out.println("StringIndexReader.getFieldNames");
>>>> return null;
>>>> }
>>>>
>>>> public Collection getIndexedFieldNames(Field.TermVector tvSpec) {
>>>> System.out.println("StringIndexReader.getIndexedFieldNames");
>>>> return null;
>>>> }
>>>>
>>>> public Collection getFieldNames(FieldOption fldOption) {
>>>> System.out.println("StringIndexReader.getFieldNames");
>>>> return null;
>>>> }
>>>>
>>>> public static void main(String[] args) {
>>>> IndexReader reader = new StringIndexReader(new String[] {"foo",
>>>> "bar", "baz"});
>>>> IndexSearcher searcher = new IndexSearcher(reader);
>>>>
>>>> Hits hits = null;
>>>> try {
>>>> hits = searcher.search(new WildcardQuery(new
>>>> Term("field","ba*")));
>>>> } catch (IOException e) {
>>>> e.printStackTrace();
>>>> }
>>>> System.out.println("found " + hits.length());
>>>> }
>>>> }
>>>>
>>>>
>>>> --------------------------------------------------------------------
>>>> -
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Wolfgang Hoschek <wh...@lbl.gov>.
The main issue is to enable handling arbitrary queries (anything
derived from o.a.l.search.Query). Yes, there'd be an additional method
Analyzer parameter (support any analyzer). The use case does not
require field names. One could internally use "content" or anything
else for the default field name, which is the one to implicitly be
queried...
Wolfgang.
On Apr 15, 2005, at 5:08 PM, Erik Hatcher wrote:
> On Apr 15, 2005, at 6:15 PM, Wolfgang Hoschek wrote:
>> Cool! For my use case it would need to be able to handle arbitrary
>> queries (previously parsed from a general lucene query string).
>> Something like:
>>
>> float match(String Text, Query query)
>>
>> it's fine with me if it also works for
>>
>> float[] match(String[] texts, Query query) or
>> float(Document doc, Query query)
>>
>> but that isn't required by the use case.
>
> My implementation is nearly that. The score is available as
> hits.score(0). You would also need an analyzer, I presume, passed to
> your proposed match() method if you want the text broken into terms.
> My current implementation is passed a String[] where each item is
> considered a term for the document. match() would also need a field
> name to be fully accurate - since the analyzer needs a field name and
> terms used for searching need a field name. The Query may contain
> terms for any number of fields - how should that be handled? Should
> only a single field name be passed in and any terms request for other
> fields be ignored? Or should this utility morph to assume any words
> in the text is in any field being asked of it?
>
> As for Doug's devil advocate questions - I really don't know what I'd
> use it for personally (other than the "match this single string
> against a bunch of queries"), I just thought it was clever that it
> could be done. Clever regex's could come close, but it'd be a lot
> more effort than reusing good ol' QueryParser and this simplistic
> IndexReader, along with an Analyzer.
>
> Erik
>
>>
>> Wolfgang.
>>
>>> I am intrigued by this and decided to mock a quick and dirty example
>>> of such an IndexReader. After a little trial-and-error I got it
>>> working at least for TermQuery and WildcardQuery. I've pasted my
>>> code below as an example, but there is much room for improvement,
>>> especially in terms of performance and also in keeping track of term
>>> frequency, and also it would be nicer if it handled the analysis
>>> internally.
>>>
>>> I think something like this would make a handy addition to our
>>> contrib area at least. I'd be happy to receive improvements to this
>>> and then add it to a contrib subproject.
>>>
>>> Perhaps this would be a handy way to handle situations where users
>>> have queries saved in a system and need to be alerted whenever a new
>>> document arrives matching the saved queries?
>>>
>>> Erik
>>>
>>>
>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
>>>> Sent: Thursday, April 14, 2005 4:04 PM
>>>> To: java-dev@lucene.apache.org
>>>> Subject: Re: [Performance] Streaming main memory indexing of single
>>>> strings
>>>>
>>>>
>>>> This seems to be a promising avenue worth exploring. My gutfeeling
>>>> is
>>>> that this could easily be 10-100 times faster.
>>>>
>>>> The drawback is that it requires a fair amount of understanding of
>>>> intricate Lucene internals, pulling those pieces together and
>>>> adapting
>>>> them as required for the seemingly simple "float match(String text,
>>>> Query query)".
>>>>
>>>> I might give it a shot but I'm not sure I'll be able to pull this
>>>> off!
>>>> Is there any similar code I could look at as a starting point?
>>>>
>>>> Wolfgang.
>>>>
>>>> On Apr 14, 2005, at 1:13 PM, Robert Engels wrote:
>>>>
>>>>> I think you are not approaching this the correct way.
>>>>>
>>>>> Pseudo code:
>>>>>
>>>>> Subclass IndexReader.
>>>>>
>>>>> Get tokens from String 'document' using Lucene analyzers.
>>>>>
>>>>> Build simple hash-map based data structures using tokens for terms,
>>>>> and term
>>>>> positions.
>>>>>
>>>>> reimplement termDocs() and termPositions() to use the structures
>>>>> from
>>>>> above.
>>>>>
>>>>> run searches.
>>>>>
>>>>> start again with next document.
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
>>>>> Sent: Thursday, April 14, 2005 2:56 PM
>>>>> To: java-dev@lucene.apache.org
>>>>> Subject: Re: [Performance] Streaming main memory indexing of single
>>>>> strings
>>>>>
>>>>>
>>>>> Otis, this might be a misunderstanding.
>>>>>
>>>>> - I'm not calling optimize(). That piece is commented out you if
>>>>> look
>>>>> again at the code.
>>>>> - The *streaming* use case requires that for each query I add one
>>>>> (and
>>>>> only one) document (aka string) to an empty index:
>>>>>
>>>>> repeat N times (where N is millions or billions):
>>>>> add a single string (aka document) to an empty index
>>>>> query the index
>>>>> drop index (or delete it's document)
>>>>>
>>>>> with the following API being called N times: float match(String
>>>>> text,
>>>>> Query query)
>>>>>
>>>>> So there's no possibility of adding many documents and thereafter
>>>>> running the query. This in turn seems to mean that the IndexWriter
>>>>> can't be kept open - unless I manually delete each document after
>>>>> each
>>>>> query to repeatedly reuse the RAMDirectory, which I've also tried
>>>>> before without any significant performance gain - deletion seems to
>>>>> have substantial overhead in itself. Perhaps it would be better if
>>>>> there were a Directory.deleteAllDocuments() or similar. Did you
>>>>> have
>>>>> some other approach in mind?
>>>>>
>>>>> As I said, Lucene's design doesn't seem to fit this streaming use
>>>>> case
>>>>> pattern well. In *this* scenario one could easily do without any
>>>>> locking, and without byte level organization in RAMDirectory and
>>>>> RAMFile, etc because a single small string isn't a large persistent
>>>>> multi-document index.
>>>>>
>>>>> For some background, here's a small example for the kind of XQuery
>>>>> functionality Nux/Lucene integration enables:
>>>>>
>>>>> (: An XQuery that finds all books authored by James that have
>>>>> something
>>>>> to do with "fish", sorted by relevance :)
>>>>> declare namespace lucene = "java:nux.xom.xquery.XQueryUtil";
>>>>> declare variable $query := "fish*~";
>>>>>
>>>>> for $book in /books/book[author="James" and lucene:match(string(.),
>>>>> $query) > 0.0]
>>>>> let $score := lucene:match(string($book), $query)
>>>>> order by $score descending
>>>>> return (<score>{$score}</score>, $book)
>>>>>
>>>>> More interestingly one can use this for classifying and routing XML
>>>>> messages based on rules (i.e. queries) inspecting their content...
>>>>>
>>>>> Any other clues about potential improvements would be greatly
>>>>> appreciated.
>>>>>
>>>>> Wolfgang.
>>>>>
>>>>> On Apr 13, 2005, at 10:09 PM, Otis Gospodnetic wrote:
>>>>>
>>>>>> It looks like you are calling that IndexWriter code in some loops,
>>>>>> opening it and closing it in every iteration of the loop and also
>>>>>> calling optimize. All of those things could be improved.
>>>>>> Keep your IndexWriter open, don't close it, and optimize the index
>>>>>> only
>>>>>> once you are done adding documents to it.
>>>>>>
>>>>>> See the highlights and the snipets in the first hit:
>>>>>> http://www.lucenebook.com/search?query=when+to+optimize
>>>>>>
>>>>>> Otis
>>>>>>
>>>>>>
>>>>>> --- Wolfgang Hoschek <wh...@lbl.gov> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm wondering if anyone could let me know how to improve Lucene
>>>>>>> performance for "streaming main memory indexing of single
>>>>>>> strings".
>>>>>>> This would help to effectively integrate Lucene with the Nux
>>>>>>> XQuery
>>>>>>> engine.
>>>>>>>
>>>>>>> Below is a small microbenchmark simulating STREAMING XQuery
>>>>>>> fulltext
>>>>>>> search as typical for XML network routers, message queuing
>>>>>>> system,
>>>>>>> P2P
>>>>>>> networks, etc. In this on-the-fly main memory indexing scenario,
>>>>>>> each
>>>>>>>
>>>>>>> individual string is immediately matched as soon as it becomes
>>>>>>> available without any persistance involved. This usage scenario
>>>>>>> and
>>>>>>> corresponding performance profile is quite different in
>>>>>>> comparison to
>>>>>>>
>>>>>>> fulltext search over persistent (read-mostly) indexes.
>>>>>>>
>>>>>>> The benchmark runs at some 3000 lucene queries/sec (lucene-1.4.3)
>>>>>>> which
>>>>>>> is unfortunate news considering the XQuery engine can easily walk
>>>>>>> hundreds of thousands of XML nodes per second. Ideally I'd like
>>>>>>> to
>>>>>>> run
>>>>>>> at some 100000 queries/sec. Runnning this through the JDK 1.5
>>>>>>> profiler
>>>>>>> it seems that most time is spent in and below the following
>>>>>>> calls:
>>>>>>>
>>>>>>> writer = new IndexWriter(dir, analyzer, true);
>>>>>>> writer.addDocument(...);
>>>>>>> writer.close();
>>>>>>>
>>>>>>> I tried quite a few variants of the benchmark with various
>>>>>>> options,
>>>>>>> unfortunately with little or no effect.
>>>>>>> Lucene just does not seem to designed to do this sort of
>>>>>>> "transient
>>>>>>> single string index" thing. All code paths related to opening,
>>>>>>> closing,
>>>>>>> reading, writing, querying and object creation seem to be
>>>>>>> designed
>>>>>>> for
>>>>>>> large persistent indexes.
>>>>>>>
>>>>>>> Any advice on what I'm missing or what could be done about it
>>>>>>> would
>>>>>>> be
>>>>>>> greatly appreciated.
>>>>>>>
>>>>>>> Wolfgang.
>>>>>>>
>>>>>>> P.S. the benchmark code is attached as a file below:
>>>>>>>
>>>>>>>> package nux.xom.pool;
>>>>>>>
>>>>>>> import java.io.IOException;
>>>>>>> //import java.io.Reader;
>>>>>>>
>>>>>>> import org.apache.lucene.analysis.Analyzer;
>>>>>>> //import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>>> //import org.apache.lucene.analysis.PorterStemFilter;
>>>>>>> //import org.apache.lucene.analysis.SimpleAnalyzer;
>>>>>>> //import org.apache.lucene.analysis.TokenStream;
>>>>>>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>>>>>>> import org.apache.lucene.document.Document;
>>>>>>> import org.apache.lucene.document.Field;
>>>>>>> //import org.apache.lucene.index.IndexReader;
>>>>>>> import org.apache.lucene.index.IndexWriter;
>>>>>>> import org.apache.lucene.queryParser.ParseException;
>>>>>>> import org.apache.lucene.queryParser.QueryParser;
>>>>>>> import org.apache.lucene.search.Hits;
>>>>>>> import org.apache.lucene.search.IndexSearcher;
>>>>>>> import org.apache.lucene.search.Query;
>>>>>>> import org.apache.lucene.search.Searcher;
>>>>>>> import org.apache.lucene.store.Directory;
>>>>>>> import org.apache.lucene.store.RAMDirectory;
>>>>>>>
>>>>>>> public final class LuceneMatcher { // TODO: make non-public
>>>>>>>
>>>>>>> private final Analyzer analyzer;
>>>>>>> // private final Directory dir = new RAMDirectory();
>>>>>>>
>>>>>>> public LuceneMatcher() {
>>>>>>> this(new StandardAnalyzer());
>>>>>>> // this(new SimpleAnalyzer());
>>>>>>> // this(new StopAnalyzer());
>>>>>>> // this(new Analyzer() {
>>>>>>> // public final TokenStream tokenStream(String fieldName,
>>>>>>> Reader
>>>>>>> reader) {
>>>>>>> // return new PorterStemFilter(new
>>>>>>> LowerCaseTokenizer(reader));
>>>>>>> // }
>>>>>>> // });
>>>>>>> }
>>>>>>>
>>>>>>> public LuceneMatcher(Analyzer analyzer) {
>>>>>>> if (analyzer == null)
>>>>>>> throw new IllegalArgumentException("analyzer must not be
>>>>>>> null");
>>>>>>> this.analyzer = analyzer;
>>>>>>> }
>>>>>>>
>>>>>>> public Query parseQuery(String expression) throws
>>>>>>> ParseException {
>>>>>>> QueryParser parser = new QueryParser("content", analyzer);
>>>>>>> // parser.setPhraseSlop(0);
>>>>>>> return parser.parse(expression);
>>>>>>> }
>>>>>>>
>>>>>>> /**
>>>>>>> * Returns the relevance score by matching the given index
>>>>>>> against
>>>>>>> the given
>>>>>>> * Lucene query expression. The index must not contain more
>>>>>>> than one
>>>>>>> Lucene
>>>>>>> * "document" (aka string to be searched).
>>>>>>> */
>>>>>>> public float match(Directory index, Query query) {
>>>>>>> Searcher searcher = null;
>>>>>>> try {
>>>>>>> searcher = new IndexSearcher(index);
>>>>>>> Hits hits = searcher.search(query);
>>>>>>> float score = hits.length() > 0 ? hits.score(0) : 0.0f;
>>>>>>> return score;
>>>>>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>>>>>> throw new RuntimeException(e);
>>>>>>> } finally {
>>>>>>> try {
>>>>>>> if (searcher != null) searcher.close();
>>>>>>> } catch (IOException e) { // should never happen
>>>>>>> (RAMDirectory)
>>>>>>> throw new RuntimeException(e);
>>>>>>> }
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> // public float match(String text, Query query) {
>>>>>>> // return match(createIndex(text), query);
>>>>>>> // }
>>>>>>>
>>>>>>> public Directory createIndex(String text) {
>>>>>>> Directory dir = new RAMDirectory();
>>>>>>> IndexWriter writer = null;
>>>>>>> try {
>>>>>>> writer = new IndexWriter(dir, analyzer, true);
>>>>>>> // writer.setUseCompoundFile(false);
>>>>>>> // writer.mergeFactor = 2;
>>>>>>> // writer.minMergeDocs = 1;
>>>>>>> // writer.maxMergeDocs = 1;
>>>>>>>
>>>>>>> writer.addDocument(createDocument(text));
>>>>>>> // writer.optimize();
>>>>>>> return dir;
>>>>>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>>>>>> throw new RuntimeException(e);
>>>>>>> } finally {
>>>>>>> try {
>>>>>>> if (writer != null) writer.close();
>>>>>>> } catch (IOException e) { // should never happen
>>>>>>> (RAMDirectory)
>>>>>>> throw new RuntimeException(e);
>>>>>>> }
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> private Document createDocument(String content) {
>>>>>>> Document doc = new Document();
>>>>>>> doc.add(Field.UnStored("content", content));
>>>>>>> // doc.add(Field.Text("x", content));
>>>>>>> return doc;
>>>>>>> }
>>>>>>>
>>>>>>> /**
>>>>>>> * Lucene microbenchmark simulating STREAMING XQuery fulltext
>>>>>>> search
>>>>>>> as
>>>>>>> * typical for XML network routers, message queuing system, P2P
>>>>>>> networks,
>>>>>>> * etc. In this on-the-fly main memory indexing scenario, each
>>>>>>> individual
>>>>>>> * string is immediately matched as soon as it becomes available
>>>>>>> without any
>>>>>>> * persistance involved. This usage scenario and corresponding
>>>>>>> performance
>>>>>>> * profile is quite different in comparison to fulltext search
>>>>>>> over
>>>>>>> * persistent (read-mostly) indexes.
>>>>>>> *
>>>>>>> * Example XPath:
>>>>>>> count(/table/row[lucene:match(string(./firstname),
>>>>>>> * "James") > 0.0])
>>>>>>> */
>>>>>>> public static void main(String[] args) throws Exception {
>>>>>>> int k = -1;
>>>>>>> int runs = 5;
>>>>>>> if (args.length > ++k) runs = Integer.parseInt(args[k]);
>>>>>>>
>>>>>>> int nodes = 10000;
>>>>>>> if (args.length > ++k) nodes = Integer.parseInt(args[k]);
>>>>>>>
>>>>>>> String content = "James is out in the woods";
>>>>>>> if (args.length > ++k) content = args[k];
>>>>>>>
>>>>>>> String expression = "James";
>>>>>>> if (args.length > ++k) expression = args[k];
>>>>>>>
>>>>>>> LuceneMatcher matcher = new LuceneMatcher();
>>>>>>> Query query = matcher.parseQuery(expression); // to be reused N
>>>>>>> times
>>>>>>>
>>>>>>> for (int r = 0; r < runs; r++) {
>>>>>>> long start = System.currentTimeMillis();
>>>>>>> int matches = 0;
>>>>>>>
>>>>>>> for (int i = 0; i < nodes; i++) {
>>>>>>> // if (LuceneUtil.match(content + i, expression) > 0.0f) {
>>>>>>> if (matcher.match(matcher.createIndex(content + i), query) >
>>>>>>> 0.0f) {
>>>>>>> matches++;
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> long end = System.currentTimeMillis();
>>>>>>> System.out.println("matches=" + matches);
>>>>>>> System.out.println("secs=" + ((end-start) / 1000.0f));
>>>>>>> System.out.println("queries/sec=" + (nodes / ((end-start) /
>>>>>>> 1000.0f)));
>>>>>>> System.out.println();
>>>>>>> }
>>>>>>> }
>>>>>>> }
>>>
>>> public class StringIndexReader extends IndexReader {
>>> private List terms;
>>> public StringIndexReader(String strings[]) {
>>> super(null);
>>> terms = Arrays.asList(strings);
>>> Collections.sort(terms);
>>> }
>>>
>>> public TermFreqVector[] getTermFreqVectors(int docNumber) throws
>>> IOException {
>>> System.out.println("StringIndexReader.getTermFreqVectors");
>>> return new TermFreqVector[0];
>>> }
>>>
>>> public TermFreqVector getTermFreqVector(int docNumber, String
>>> field) throws IOException {
>>> System.out.println("StringIndexReader.getTermFreqVector");
>>> return null;
>>> }
>>>
>>> public int numDocs() {
>>> System.out.println("StringIndexReader.numDocs");
>>> return 1;
>>> }
>>>
>>> public int maxDoc() {
>>> System.out.println("StringIndexReader.maxDoc");
>>> return 1;
>>> }
>>>
>>> public Document document(int n) throws IOException {
>>> System.out.println("StringIndexReader.document");
>>> return null;
>>> }
>>>
>>> public boolean isDeleted(int n) {
>>> System.out.println("StringIndexReader.isDeleted");
>>> return false;
>>> }
>>>
>>> public boolean hasDeletions() {
>>> System.out.println("StringIndexReader.hasDeletions");
>>> return false;
>>> }
>>>
>>> public byte[] norms(String field) throws IOException {
>>> // TODO: what value to use for this?
>>> System.out.println("StringIndexReader.norms: " + field);
>>> return new byte[] { 1 };
>>> }
>>>
>>> public void norms(String field, byte[] bytes, int offset) throws
>>> IOException {
>>> System.out.println("StringIndexReader.norms: " + field + "*");
>>> }
>>>
>>> protected void doSetNorm(int doc, String field, byte value) throws
>>> IOException {
>>> System.out.println("StringIndexReader.doSetNorm");
>>>
>>> }
>>>
>>> public TermEnum terms() throws IOException {
>>> System.out.println("StringIndexReader.terms");
>>> return terms(null);
>>> }
>>>
>>> public TermEnum terms(final Term term) throws IOException {
>>> System.out.println("StringIndexReader.terms: " + term);
>>>
>>> TermEnum termEnum = new TermEnum() {
>>> private String currentTerm;
>>> private Iterator iter;
>>>
>>> public boolean next() {
>>> System.out.println("TermEnum.next");
>>> if (iter.hasNext())
>>> currentTerm = (String) iter.next();
>>> return iter.hasNext();
>>> }
>>>
>>> public Term term() {
>>> if (iter == null) {
>>> iter = terms.iterator();
>>> while (next()) {
>>> if (currentTerm.startsWith(term.text()))
>>> break;
>>> }
>>> }
>>> System.out.println("TermEnum.term: " + currentTerm);
>>> return new Term(term.field(), currentTerm);
>>> }
>>>
>>> public int docFreq() {
>>> System.out.println("TermEnum.docFreq");
>>> return 1;
>>> }
>>>
>>> public void close() {
>>> System.out.println("TermEnum.close");
>>> }
>>> };
>>> return termEnum;
>>> }
>>>
>>> public int docFreq(Term term) throws IOException {
>>> System.out.println("StringIndexReader.docFreq: " + term);
>>> return terms.contains(term.text()) ? 1 : 0;
>>> }
>>>
>>> public TermDocs termDocs() throws IOException {
>>> System.out.println("StringIndexReader.termDocs");
>>>
>>> TermDocs td = new TermDocs() {
>>> private boolean done = false;
>>> String currentTerm;
>>>
>>> public void seek(Term term) {
>>> System.out.println(".seek: " + term);
>>> currentTerm = term.text();
>>> done = false;
>>> }
>>>
>>> public void seek(TermEnum termEnum) {
>>> seek(termEnum.term());
>>> }
>>>
>>> public int doc() {
>>> System.out.println(".doc");
>>> return 0;
>>> }
>>>
>>> public int freq() {
>>> System.out.println(".freq");
>>> return 1;
>>> }
>>>
>>> public boolean next() {
>>> System.out.println(".next");
>>> return false;
>>> }
>>>
>>> public int read(int[] docs, int[] freqs) {
>>> System.out.println(".read: " + docs.length);
>>>
>>> if (done) return 0;
>>>
>>> done = true;
>>> docs[0] = 0;
>>> freqs[0] = freq();
>>> return 1;
>>> }
>>>
>>> public boolean skipTo(int target) {
>>> System.out.println(".skipTo");
>>> return false;
>>> }
>>>
>>> public void close() {
>>> System.out.println(".close");
>>>
>>> }
>>> };
>>> return td;
>>> }
>>>
>>> public TermPositions termPositions() throws IOException {
>>> System.out.println("StringIndexReader.termPositions");
>>> return null;
>>> }
>>>
>>> protected void doDelete(int docNum) throws IOException {
>>> System.out.println("StringIndexReader.doDelete");
>>>
>>> }
>>>
>>> protected void doUndeleteAll() throws IOException {
>>> System.out.println("StringIndexReader.doUndeleteAll");
>>>
>>> }
>>>
>>> protected void doCommit() throws IOException {
>>> System.out.println("StringIndexReader.doCommit");
>>>
>>> }
>>>
>>> protected void doClose() throws IOException {
>>> System.out.println("StringIndexReader.doClose");
>>>
>>> }
>>>
>>> public Collection getFieldNames() throws IOException {
>>> System.out.println("StringIndexReader.getFieldNames");
>>> return null;
>>> }
>>>
>>> public Collection getFieldNames(boolean indexed) throws
>>> IOException {
>>> System.out.println("StringIndexReader.getFieldNames");
>>> return null;
>>> }
>>>
>>> public Collection getIndexedFieldNames(Field.TermVector tvSpec) {
>>> System.out.println("StringIndexReader.getIndexedFieldNames");
>>> return null;
>>> }
>>>
>>> public Collection getFieldNames(FieldOption fldOption) {
>>> System.out.println("StringIndexReader.getFieldNames");
>>> return null;
>>> }
>>>
>>> public static void main(String[] args) {
>>> IndexReader reader = new StringIndexReader(new String[] {"foo",
>>> "bar", "baz"});
>>> IndexSearcher searcher = new IndexSearcher(reader);
>>>
>>> Hits hits = null;
>>> try {
>>> hits = searcher.search(new WildcardQuery(new
>>> Term("field","ba*")));
>>> } catch (IOException e) {
>>> e.printStackTrace();
>>> }
>>> System.out.println("found " + hits.length());
>>> }
>>> }
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Apr 15, 2005, at 6:15 PM, Wolfgang Hoschek wrote:
> Cool! For my use case it would need to be able to handle arbitrary
> queries (previously parsed from a general lucene query string).
> Something like:
>
> float match(String Text, Query query)
>
> it's fine with me if it also works for
>
> float[] match(String[] texts, Query query) or
> float(Document doc, Query query)
>
> but that isn't required by the use case.
My implementation is nearly that. The score is available as
hits.score(0). You would also need an analyzer, I presume, passed to
your proposed match() method if you want the text broken into terms.
My current implementation is passed a String[] where each item is
considered a term for the document. match() would also need a field
name to be fully accurate - since the analyzer needs a field name and
terms used for searching need a field name. The Query may contain
terms for any number of fields - how should that be handled? Should
only a single field name be passed in and any terms request for other
fields be ignored? Or should this utility morph to assume any words in
the text is in any field being asked of it?
As for Doug's devil advocate questions - I really don't know what I'd
use it for personally (other than the "match this single string against
a bunch of queries"), I just thought it was clever that it could be
done. Clever regex's could come close, but it'd be a lot more effort
than reusing good ol' QueryParser and this simplistic IndexReader,
along with an Analyzer.
Erik
>
> Wolfgang.
>
>> I am intrigued by this and decided to mock a quick and dirty example
>> of such an IndexReader. After a little trial-and-error I got it
>> working at least for TermQuery and WildcardQuery. I've pasted my
>> code below as an example, but there is much room for improvement,
>> especially in terms of performance and also in keeping track of term
>> frequency, and also it would be nicer if it handled the analysis
>> internally.
>>
>> I think something like this would make a handy addition to our
>> contrib area at least. I'd be happy to receive improvements to this
>> and then add it to a contrib subproject.
>>
>> Perhaps this would be a handy way to handle situations where users
>> have queries saved in a system and need to be alerted whenever a new
>> document arrives matching the saved queries?
>>
>> Erik
>>
>>
>>
>>>
>>>
>>> -----Original Message-----
>>> From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
>>> Sent: Thursday, April 14, 2005 4:04 PM
>>> To: java-dev@lucene.apache.org
>>> Subject: Re: [Performance] Streaming main memory indexing of single
>>> strings
>>>
>>>
>>> This seems to be a promising avenue worth exploring. My gutfeeling is
>>> that this could easily be 10-100 times faster.
>>>
>>> The drawback is that it requires a fair amount of understanding of
>>> intricate Lucene internals, pulling those pieces together and
>>> adapting
>>> them as required for the seemingly simple "float match(String text,
>>> Query query)".
>>>
>>> I might give it a shot but I'm not sure I'll be able to pull this
>>> off!
>>> Is there any similar code I could look at as a starting point?
>>>
>>> Wolfgang.
>>>
>>> On Apr 14, 2005, at 1:13 PM, Robert Engels wrote:
>>>
>>>> I think you are not approaching this the correct way.
>>>>
>>>> Pseudo code:
>>>>
>>>> Subclass IndexReader.
>>>>
>>>> Get tokens from String 'document' using Lucene analyzers.
>>>>
>>>> Build simple hash-map based data structures using tokens for terms,
>>>> and term
>>>> positions.
>>>>
>>>> reimplement termDocs() and termPositions() to use the structures
>>>> from
>>>> above.
>>>>
>>>> run searches.
>>>>
>>>> start again with next document.
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
>>>> Sent: Thursday, April 14, 2005 2:56 PM
>>>> To: java-dev@lucene.apache.org
>>>> Subject: Re: [Performance] Streaming main memory indexing of single
>>>> strings
>>>>
>>>>
>>>> Otis, this might be a misunderstanding.
>>>>
>>>> - I'm not calling optimize(). That piece is commented out you if
>>>> look
>>>> again at the code.
>>>> - The *streaming* use case requires that for each query I add one
>>>> (and
>>>> only one) document (aka string) to an empty index:
>>>>
>>>> repeat N times (where N is millions or billions):
>>>> add a single string (aka document) to an empty index
>>>> query the index
>>>> drop index (or delete it's document)
>>>>
>>>> with the following API being called N times: float match(String
>>>> text,
>>>> Query query)
>>>>
>>>> So there's no possibility of adding many documents and thereafter
>>>> running the query. This in turn seems to mean that the IndexWriter
>>>> can't be kept open - unless I manually delete each document after
>>>> each
>>>> query to repeatedly reuse the RAMDirectory, which I've also tried
>>>> before without any significant performance gain - deletion seems to
>>>> have substantial overhead in itself. Perhaps it would be better if
>>>> there were a Directory.deleteAllDocuments() or similar. Did you have
>>>> some other approach in mind?
>>>>
>>>> As I said, Lucene's design doesn't seem to fit this streaming use
>>>> case
>>>> pattern well. In *this* scenario one could easily do without any
>>>> locking, and without byte level organization in RAMDirectory and
>>>> RAMFile, etc because a single small string isn't a large persistent
>>>> multi-document index.
>>>>
>>>> For some background, here's a small example for the kind of XQuery
>>>> functionality Nux/Lucene integration enables:
>>>>
>>>> (: An XQuery that finds all books authored by James that have
>>>> something
>>>> to do with "fish", sorted by relevance :)
>>>> declare namespace lucene = "java:nux.xom.xquery.XQueryUtil";
>>>> declare variable $query := "fish*~";
>>>>
>>>> for $book in /books/book[author="James" and lucene:match(string(.),
>>>> $query) > 0.0]
>>>> let $score := lucene:match(string($book), $query)
>>>> order by $score descending
>>>> return (<score>{$score}</score>, $book)
>>>>
>>>> More interestingly one can use this for classifying and routing XML
>>>> messages based on rules (i.e. queries) inspecting their content...
>>>>
>>>> Any other clues about potential improvements would be greatly
>>>> appreciated.
>>>>
>>>> Wolfgang.
>>>>
>>>> On Apr 13, 2005, at 10:09 PM, Otis Gospodnetic wrote:
>>>>
>>>>> It looks like you are calling that IndexWriter code in some loops,
>>>>> opening it and closing it in every iteration of the loop and also
>>>>> calling optimize. All of those things could be improved.
>>>>> Keep your IndexWriter open, don't close it, and optimize the index
>>>>> only
>>>>> once you are done adding documents to it.
>>>>>
>>>>> See the highlights and the snipets in the first hit:
>>>>> http://www.lucenebook.com/search?query=when+to+optimize
>>>>>
>>>>> Otis
>>>>>
>>>>>
>>>>> --- Wolfgang Hoschek <wh...@lbl.gov> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm wondering if anyone could let me know how to improve Lucene
>>>>>> performance for "streaming main memory indexing of single
>>>>>> strings".
>>>>>> This would help to effectively integrate Lucene with the Nux
>>>>>> XQuery
>>>>>> engine.
>>>>>>
>>>>>> Below is a small microbenchmark simulating STREAMING XQuery
>>>>>> fulltext
>>>>>> search as typical for XML network routers, message queuing system,
>>>>>> P2P
>>>>>> networks, etc. In this on-the-fly main memory indexing scenario,
>>>>>> each
>>>>>>
>>>>>> individual string is immediately matched as soon as it becomes
>>>>>> available without any persistance involved. This usage scenario
>>>>>> and
>>>>>> corresponding performance profile is quite different in
>>>>>> comparison to
>>>>>>
>>>>>> fulltext search over persistent (read-mostly) indexes.
>>>>>>
>>>>>> The benchmark runs at some 3000 lucene queries/sec (lucene-1.4.3)
>>>>>> which
>>>>>> is unfortunate news considering the XQuery engine can easily walk
>>>>>> hundreds of thousands of XML nodes per second. Ideally I'd like to
>>>>>> run
>>>>>> at some 100000 queries/sec. Runnning this through the JDK 1.5
>>>>>> profiler
>>>>>> it seems that most time is spent in and below the following calls:
>>>>>>
>>>>>> writer = new IndexWriter(dir, analyzer, true);
>>>>>> writer.addDocument(...);
>>>>>> writer.close();
>>>>>>
>>>>>> I tried quite a few variants of the benchmark with various
>>>>>> options,
>>>>>> unfortunately with little or no effect.
>>>>>> Lucene just does not seem to designed to do this sort of
>>>>>> "transient
>>>>>> single string index" thing. All code paths related to opening,
>>>>>> closing,
>>>>>> reading, writing, querying and object creation seem to be designed
>>>>>> for
>>>>>> large persistent indexes.
>>>>>>
>>>>>> Any advice on what I'm missing or what could be done about it
>>>>>> would
>>>>>> be
>>>>>> greatly appreciated.
>>>>>>
>>>>>> Wolfgang.
>>>>>>
>>>>>> P.S. the benchmark code is attached as a file below:
>>>>>>
>>>>>>> package nux.xom.pool;
>>>>>>
>>>>>> import java.io.IOException;
>>>>>> //import java.io.Reader;
>>>>>>
>>>>>> import org.apache.lucene.analysis.Analyzer;
>>>>>> //import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>> //import org.apache.lucene.analysis.PorterStemFilter;
>>>>>> //import org.apache.lucene.analysis.SimpleAnalyzer;
>>>>>> //import org.apache.lucene.analysis.TokenStream;
>>>>>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>>>>>> import org.apache.lucene.document.Document;
>>>>>> import org.apache.lucene.document.Field;
>>>>>> //import org.apache.lucene.index.IndexReader;
>>>>>> import org.apache.lucene.index.IndexWriter;
>>>>>> import org.apache.lucene.queryParser.ParseException;
>>>>>> import org.apache.lucene.queryParser.QueryParser;
>>>>>> import org.apache.lucene.search.Hits;
>>>>>> import org.apache.lucene.search.IndexSearcher;
>>>>>> import org.apache.lucene.search.Query;
>>>>>> import org.apache.lucene.search.Searcher;
>>>>>> import org.apache.lucene.store.Directory;
>>>>>> import org.apache.lucene.store.RAMDirectory;
>>>>>>
>>>>>> public final class LuceneMatcher { // TODO: make non-public
>>>>>>
>>>>>> private final Analyzer analyzer;
>>>>>> // private final Directory dir = new RAMDirectory();
>>>>>>
>>>>>> public LuceneMatcher() {
>>>>>> this(new StandardAnalyzer());
>>>>>> // this(new SimpleAnalyzer());
>>>>>> // this(new StopAnalyzer());
>>>>>> // this(new Analyzer() {
>>>>>> // public final TokenStream tokenStream(String fieldName, Reader
>>>>>> reader) {
>>>>>> // return new PorterStemFilter(new LowerCaseTokenizer(reader));
>>>>>> // }
>>>>>> // });
>>>>>> }
>>>>>>
>>>>>> public LuceneMatcher(Analyzer analyzer) {
>>>>>> if (analyzer == null)
>>>>>> throw new IllegalArgumentException("analyzer must not be
>>>>>> null");
>>>>>> this.analyzer = analyzer;
>>>>>> }
>>>>>>
>>>>>> public Query parseQuery(String expression) throws ParseException
>>>>>> {
>>>>>> QueryParser parser = new QueryParser("content", analyzer);
>>>>>> // parser.setPhraseSlop(0);
>>>>>> return parser.parse(expression);
>>>>>> }
>>>>>>
>>>>>> /**
>>>>>> * Returns the relevance score by matching the given index
>>>>>> against
>>>>>> the given
>>>>>> * Lucene query expression. The index must not contain more than
>>>>>> one
>>>>>> Lucene
>>>>>> * "document" (aka string to be searched).
>>>>>> */
>>>>>> public float match(Directory index, Query query) {
>>>>>> Searcher searcher = null;
>>>>>> try {
>>>>>> searcher = new IndexSearcher(index);
>>>>>> Hits hits = searcher.search(query);
>>>>>> float score = hits.length() > 0 ? hits.score(0) : 0.0f;
>>>>>> return score;
>>>>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>>>>> throw new RuntimeException(e);
>>>>>> } finally {
>>>>>> try {
>>>>>> if (searcher != null) searcher.close();
>>>>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>>>>> throw new RuntimeException(e);
>>>>>> }
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> // public float match(String text, Query query) {
>>>>>> // return match(createIndex(text), query);
>>>>>> // }
>>>>>>
>>>>>> public Directory createIndex(String text) {
>>>>>> Directory dir = new RAMDirectory();
>>>>>> IndexWriter writer = null;
>>>>>> try {
>>>>>> writer = new IndexWriter(dir, analyzer, true);
>>>>>> // writer.setUseCompoundFile(false);
>>>>>> // writer.mergeFactor = 2;
>>>>>> // writer.minMergeDocs = 1;
>>>>>> // writer.maxMergeDocs = 1;
>>>>>>
>>>>>> writer.addDocument(createDocument(text));
>>>>>> // writer.optimize();
>>>>>> return dir;
>>>>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>>>>> throw new RuntimeException(e);
>>>>>> } finally {
>>>>>> try {
>>>>>> if (writer != null) writer.close();
>>>>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>>>>> throw new RuntimeException(e);
>>>>>> }
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> private Document createDocument(String content) {
>>>>>> Document doc = new Document();
>>>>>> doc.add(Field.UnStored("content", content));
>>>>>> // doc.add(Field.Text("x", content));
>>>>>> return doc;
>>>>>> }
>>>>>>
>>>>>> /**
>>>>>> * Lucene microbenchmark simulating STREAMING XQuery fulltext
>>>>>> search
>>>>>> as
>>>>>> * typical for XML network routers, message queuing system, P2P
>>>>>> networks,
>>>>>> * etc. In this on-the-fly main memory indexing scenario, each
>>>>>> individual
>>>>>> * string is immediately matched as soon as it becomes available
>>>>>> without any
>>>>>> * persistance involved. This usage scenario and corresponding
>>>>>> performance
>>>>>> * profile is quite different in comparison to fulltext search
>>>>>> over
>>>>>> * persistent (read-mostly) indexes.
>>>>>> *
>>>>>> * Example XPath:
>>>>>> count(/table/row[lucene:match(string(./firstname),
>>>>>> * "James") > 0.0])
>>>>>> */
>>>>>> public static void main(String[] args) throws Exception {
>>>>>> int k = -1;
>>>>>> int runs = 5;
>>>>>> if (args.length > ++k) runs = Integer.parseInt(args[k]);
>>>>>>
>>>>>> int nodes = 10000;
>>>>>> if (args.length > ++k) nodes = Integer.parseInt(args[k]);
>>>>>>
>>>>>> String content = "James is out in the woods";
>>>>>> if (args.length > ++k) content = args[k];
>>>>>>
>>>>>> String expression = "James";
>>>>>> if (args.length > ++k) expression = args[k];
>>>>>>
>>>>>> LuceneMatcher matcher = new LuceneMatcher();
>>>>>> Query query = matcher.parseQuery(expression); // to be reused N
>>>>>> times
>>>>>>
>>>>>> for (int r = 0; r < runs; r++) {
>>>>>> long start = System.currentTimeMillis();
>>>>>> int matches = 0;
>>>>>>
>>>>>> for (int i = 0; i < nodes; i++) {
>>>>>> // if (LuceneUtil.match(content + i, expression) > 0.0f) {
>>>>>> if (matcher.match(matcher.createIndex(content + i), query) >
>>>>>> 0.0f) {
>>>>>> matches++;
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> long end = System.currentTimeMillis();
>>>>>> System.out.println("matches=" + matches);
>>>>>> System.out.println("secs=" + ((end-start) / 1000.0f));
>>>>>> System.out.println("queries/sec=" + (nodes / ((end-start) /
>>>>>> 1000.0f)));
>>>>>> System.out.println();
>>>>>> }
>>>>>> }
>>>>>> }
>>
>> public class StringIndexReader extends IndexReader {
>> private List terms;
>> public StringIndexReader(String strings[]) {
>> super(null);
>> terms = Arrays.asList(strings);
>> Collections.sort(terms);
>> }
>>
>> public TermFreqVector[] getTermFreqVectors(int docNumber) throws
>> IOException {
>> System.out.println("StringIndexReader.getTermFreqVectors");
>> return new TermFreqVector[0];
>> }
>>
>> public TermFreqVector getTermFreqVector(int docNumber, String
>> field) throws IOException {
>> System.out.println("StringIndexReader.getTermFreqVector");
>> return null;
>> }
>>
>> public int numDocs() {
>> System.out.println("StringIndexReader.numDocs");
>> return 1;
>> }
>>
>> public int maxDoc() {
>> System.out.println("StringIndexReader.maxDoc");
>> return 1;
>> }
>>
>> public Document document(int n) throws IOException {
>> System.out.println("StringIndexReader.document");
>> return null;
>> }
>>
>> public boolean isDeleted(int n) {
>> System.out.println("StringIndexReader.isDeleted");
>> return false;
>> }
>>
>> public boolean hasDeletions() {
>> System.out.println("StringIndexReader.hasDeletions");
>> return false;
>> }
>>
>> public byte[] norms(String field) throws IOException {
>> // TODO: what value to use for this?
>> System.out.println("StringIndexReader.norms: " + field);
>> return new byte[] { 1 };
>> }
>>
>> public void norms(String field, byte[] bytes, int offset) throws
>> IOException {
>> System.out.println("StringIndexReader.norms: " + field + "*");
>> }
>>
>> protected void doSetNorm(int doc, String field, byte value) throws
>> IOException {
>> System.out.println("StringIndexReader.doSetNorm");
>>
>> }
>>
>> public TermEnum terms() throws IOException {
>> System.out.println("StringIndexReader.terms");
>> return terms(null);
>> }
>>
>> public TermEnum terms(final Term term) throws IOException {
>> System.out.println("StringIndexReader.terms: " + term);
>>
>> TermEnum termEnum = new TermEnum() {
>> private String currentTerm;
>> private Iterator iter;
>>
>> public boolean next() {
>> System.out.println("TermEnum.next");
>> if (iter.hasNext())
>> currentTerm = (String) iter.next();
>> return iter.hasNext();
>> }
>>
>> public Term term() {
>> if (iter == null) {
>> iter = terms.iterator();
>> while (next()) {
>> if (currentTerm.startsWith(term.text()))
>> break;
>> }
>> }
>> System.out.println("TermEnum.term: " + currentTerm);
>> return new Term(term.field(), currentTerm);
>> }
>>
>> public int docFreq() {
>> System.out.println("TermEnum.docFreq");
>> return 1;
>> }
>>
>> public void close() {
>> System.out.println("TermEnum.close");
>> }
>> };
>> return termEnum;
>> }
>>
>> public int docFreq(Term term) throws IOException {
>> System.out.println("StringIndexReader.docFreq: " + term);
>> return terms.contains(term.text()) ? 1 : 0;
>> }
>>
>> public TermDocs termDocs() throws IOException {
>> System.out.println("StringIndexReader.termDocs");
>>
>> TermDocs td = new TermDocs() {
>> private boolean done = false;
>> String currentTerm;
>>
>> public void seek(Term term) {
>> System.out.println(".seek: " + term);
>> currentTerm = term.text();
>> done = false;
>> }
>>
>> public void seek(TermEnum termEnum) {
>> seek(termEnum.term());
>> }
>>
>> public int doc() {
>> System.out.println(".doc");
>> return 0;
>> }
>>
>> public int freq() {
>> System.out.println(".freq");
>> return 1;
>> }
>>
>> public boolean next() {
>> System.out.println(".next");
>> return false;
>> }
>>
>> public int read(int[] docs, int[] freqs) {
>> System.out.println(".read: " + docs.length);
>>
>> if (done) return 0;
>>
>> done = true;
>> docs[0] = 0;
>> freqs[0] = freq();
>> return 1;
>> }
>>
>> public boolean skipTo(int target) {
>> System.out.println(".skipTo");
>> return false;
>> }
>>
>> public void close() {
>> System.out.println(".close");
>>
>> }
>> };
>> return td;
>> }
>>
>> public TermPositions termPositions() throws IOException {
>> System.out.println("StringIndexReader.termPositions");
>> return null;
>> }
>>
>> protected void doDelete(int docNum) throws IOException {
>> System.out.println("StringIndexReader.doDelete");
>>
>> }
>>
>> protected void doUndeleteAll() throws IOException {
>> System.out.println("StringIndexReader.doUndeleteAll");
>>
>> }
>>
>> protected void doCommit() throws IOException {
>> System.out.println("StringIndexReader.doCommit");
>>
>> }
>>
>> protected void doClose() throws IOException {
>> System.out.println("StringIndexReader.doClose");
>>
>> }
>>
>> public Collection getFieldNames() throws IOException {
>> System.out.println("StringIndexReader.getFieldNames");
>> return null;
>> }
>>
>> public Collection getFieldNames(boolean indexed) throws IOException
>> {
>> System.out.println("StringIndexReader.getFieldNames");
>> return null;
>> }
>>
>> public Collection getIndexedFieldNames(Field.TermVector tvSpec) {
>> System.out.println("StringIndexReader.getIndexedFieldNames");
>> return null;
>> }
>>
>> public Collection getFieldNames(FieldOption fldOption) {
>> System.out.println("StringIndexReader.getFieldNames");
>> return null;
>> }
>>
>> public static void main(String[] args) {
>> IndexReader reader = new StringIndexReader(new String[] {"foo",
>> "bar", "baz"});
>> IndexSearcher searcher = new IndexSearcher(reader);
>>
>> Hits hits = null;
>> try {
>> hits = searcher.search(new WildcardQuery(new
>> Term("field","ba*")));
>> } catch (IOException e) {
>> e.printStackTrace();
>> }
>> System.out.println("found " + hits.length());
>> }
>> }
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Wolfgang Hoschek <wh...@lbl.gov>.
Cool! For my use case it would need to be able to handle arbitrary
queries (previously parsed from a general lucene query string).
Something like:
float match(String Text, Query query)
it's fine with me if it also works for
float[] match(String[] texts, Query query) or
float(Document doc, Query query)
but that isn't required by the use case.
Wolfgang.
> I am intrigued by this and decided to mock a quick and dirty example
> of such an IndexReader. After a little trial-and-error I got it
> working at least for TermQuery and WildcardQuery. I've pasted my code
> below as an example, but there is much room for improvement,
> especially in terms of performance and also in keeping track of term
> frequency, and also it would be nicer if it handled the analysis
> internally.
>
> I think something like this would make a handy addition to our contrib
> area at least. I'd be happy to receive improvements to this and then
> add it to a contrib subproject.
>
> Perhaps this would be a handy way to handle situations where users
> have queries saved in a system and need to be alerted whenever a new
> document arrives matching the saved queries?
>
> Erik
>
>
>
>>
>>
>> -----Original Message-----
>> From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
>> Sent: Thursday, April 14, 2005 4:04 PM
>> To: java-dev@lucene.apache.org
>> Subject: Re: [Performance] Streaming main memory indexing of single
>> strings
>>
>>
>> This seems to be a promising avenue worth exploring. My gutfeeling is
>> that this could easily be 10-100 times faster.
>>
>> The drawback is that it requires a fair amount of understanding of
>> intricate Lucene internals, pulling those pieces together and adapting
>> them as required for the seemingly simple "float match(String text,
>> Query query)".
>>
>> I might give it a shot but I'm not sure I'll be able to pull this off!
>> Is there any similar code I could look at as a starting point?
>>
>> Wolfgang.
>>
>> On Apr 14, 2005, at 1:13 PM, Robert Engels wrote:
>>
>>> I think you are not approaching this the correct way.
>>>
>>> Pseudo code:
>>>
>>> Subclass IndexReader.
>>>
>>> Get tokens from String 'document' using Lucene analyzers.
>>>
>>> Build simple hash-map based data structures using tokens for terms,
>>> and term
>>> positions.
>>>
>>> reimplement termDocs() and termPositions() to use the structures from
>>> above.
>>>
>>> run searches.
>>>
>>> start again with next document.
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
>>> Sent: Thursday, April 14, 2005 2:56 PM
>>> To: java-dev@lucene.apache.org
>>> Subject: Re: [Performance] Streaming main memory indexing of single
>>> strings
>>>
>>>
>>> Otis, this might be a misunderstanding.
>>>
>>> - I'm not calling optimize(). That piece is commented out you if look
>>> again at the code.
>>> - The *streaming* use case requires that for each query I add one
>>> (and
>>> only one) document (aka string) to an empty index:
>>>
>>> repeat N times (where N is millions or billions):
>>> add a single string (aka document) to an empty index
>>> query the index
>>> drop index (or delete it's document)
>>>
>>> with the following API being called N times: float match(String text,
>>> Query query)
>>>
>>> So there's no possibility of adding many documents and thereafter
>>> running the query. This in turn seems to mean that the IndexWriter
>>> can't be kept open - unless I manually delete each document after
>>> each
>>> query to repeatedly reuse the RAMDirectory, which I've also tried
>>> before without any significant performance gain - deletion seems to
>>> have substantial overhead in itself. Perhaps it would be better if
>>> there were a Directory.deleteAllDocuments() or similar. Did you have
>>> some other approach in mind?
>>>
>>> As I said, Lucene's design doesn't seem to fit this streaming use
>>> case
>>> pattern well. In *this* scenario one could easily do without any
>>> locking, and without byte level organization in RAMDirectory and
>>> RAMFile, etc because a single small string isn't a large persistent
>>> multi-document index.
>>>
>>> For some background, here's a small example for the kind of XQuery
>>> functionality Nux/Lucene integration enables:
>>>
>>> (: An XQuery that finds all books authored by James that have
>>> something
>>> to do with "fish", sorted by relevance :)
>>> declare namespace lucene = "java:nux.xom.xquery.XQueryUtil";
>>> declare variable $query := "fish*~";
>>>
>>> for $book in /books/book[author="James" and lucene:match(string(.),
>>> $query) > 0.0]
>>> let $score := lucene:match(string($book), $query)
>>> order by $score descending
>>> return (<score>{$score}</score>, $book)
>>>
>>> More interestingly one can use this for classifying and routing XML
>>> messages based on rules (i.e. queries) inspecting their content...
>>>
>>> Any other clues about potential improvements would be greatly
>>> appreciated.
>>>
>>> Wolfgang.
>>>
>>> On Apr 13, 2005, at 10:09 PM, Otis Gospodnetic wrote:
>>>
>>>> It looks like you are calling that IndexWriter code in some loops,
>>>> opening it and closing it in every iteration of the loop and also
>>>> calling optimize. All of those things could be improved.
>>>> Keep your IndexWriter open, don't close it, and optimize the index
>>>> only
>>>> once you are done adding documents to it.
>>>>
>>>> See the highlights and the snipets in the first hit:
>>>> http://www.lucenebook.com/search?query=when+to+optimize
>>>>
>>>> Otis
>>>>
>>>>
>>>> --- Wolfgang Hoschek <wh...@lbl.gov> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm wondering if anyone could let me know how to improve Lucene
>>>>> performance for "streaming main memory indexing of single strings".
>>>>> This would help to effectively integrate Lucene with the Nux XQuery
>>>>> engine.
>>>>>
>>>>> Below is a small microbenchmark simulating STREAMING XQuery
>>>>> fulltext
>>>>> search as typical for XML network routers, message queuing system,
>>>>> P2P
>>>>> networks, etc. In this on-the-fly main memory indexing scenario,
>>>>> each
>>>>>
>>>>> individual string is immediately matched as soon as it becomes
>>>>> available without any persistance involved. This usage scenario and
>>>>> corresponding performance profile is quite different in comparison
>>>>> to
>>>>>
>>>>> fulltext search over persistent (read-mostly) indexes.
>>>>>
>>>>> The benchmark runs at some 3000 lucene queries/sec (lucene-1.4.3)
>>>>> which
>>>>> is unfortunate news considering the XQuery engine can easily walk
>>>>> hundreds of thousands of XML nodes per second. Ideally I'd like to
>>>>> run
>>>>> at some 100000 queries/sec. Runnning this through the JDK 1.5
>>>>> profiler
>>>>> it seems that most time is spent in and below the following calls:
>>>>>
>>>>> writer = new IndexWriter(dir, analyzer, true);
>>>>> writer.addDocument(...);
>>>>> writer.close();
>>>>>
>>>>> I tried quite a few variants of the benchmark with various options,
>>>>> unfortunately with little or no effect.
>>>>> Lucene just does not seem to designed to do this sort of "transient
>>>>> single string index" thing. All code paths related to opening,
>>>>> closing,
>>>>> reading, writing, querying and object creation seem to be designed
>>>>> for
>>>>> large persistent indexes.
>>>>>
>>>>> Any advice on what I'm missing or what could be done about it would
>>>>> be
>>>>> greatly appreciated.
>>>>>
>>>>> Wolfgang.
>>>>>
>>>>> P.S. the benchmark code is attached as a file below:
>>>>>
>>>>>> package nux.xom.pool;
>>>>>
>>>>> import java.io.IOException;
>>>>> //import java.io.Reader;
>>>>>
>>>>> import org.apache.lucene.analysis.Analyzer;
>>>>> //import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>> //import org.apache.lucene.analysis.PorterStemFilter;
>>>>> //import org.apache.lucene.analysis.SimpleAnalyzer;
>>>>> //import org.apache.lucene.analysis.TokenStream;
>>>>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>>>>> import org.apache.lucene.document.Document;
>>>>> import org.apache.lucene.document.Field;
>>>>> //import org.apache.lucene.index.IndexReader;
>>>>> import org.apache.lucene.index.IndexWriter;
>>>>> import org.apache.lucene.queryParser.ParseException;
>>>>> import org.apache.lucene.queryParser.QueryParser;
>>>>> import org.apache.lucene.search.Hits;
>>>>> import org.apache.lucene.search.IndexSearcher;
>>>>> import org.apache.lucene.search.Query;
>>>>> import org.apache.lucene.search.Searcher;
>>>>> import org.apache.lucene.store.Directory;
>>>>> import org.apache.lucene.store.RAMDirectory;
>>>>>
>>>>> public final class LuceneMatcher { // TODO: make non-public
>>>>>
>>>>> private final Analyzer analyzer;
>>>>> // private final Directory dir = new RAMDirectory();
>>>>>
>>>>> public LuceneMatcher() {
>>>>> this(new StandardAnalyzer());
>>>>> // this(new SimpleAnalyzer());
>>>>> // this(new StopAnalyzer());
>>>>> // this(new Analyzer() {
>>>>> // public final TokenStream tokenStream(String fieldName, Reader
>>>>> reader) {
>>>>> // return new PorterStemFilter(new LowerCaseTokenizer(reader));
>>>>> // }
>>>>> // });
>>>>> }
>>>>>
>>>>> public LuceneMatcher(Analyzer analyzer) {
>>>>> if (analyzer == null)
>>>>> throw new IllegalArgumentException("analyzer must not be null");
>>>>> this.analyzer = analyzer;
>>>>> }
>>>>>
>>>>> public Query parseQuery(String expression) throws ParseException {
>>>>> QueryParser parser = new QueryParser("content", analyzer);
>>>>> // parser.setPhraseSlop(0);
>>>>> return parser.parse(expression);
>>>>> }
>>>>>
>>>>> /**
>>>>> * Returns the relevance score by matching the given index against
>>>>> the given
>>>>> * Lucene query expression. The index must not contain more than
>>>>> one
>>>>> Lucene
>>>>> * "document" (aka string to be searched).
>>>>> */
>>>>> public float match(Directory index, Query query) {
>>>>> Searcher searcher = null;
>>>>> try {
>>>>> searcher = new IndexSearcher(index);
>>>>> Hits hits = searcher.search(query);
>>>>> float score = hits.length() > 0 ? hits.score(0) : 0.0f;
>>>>> return score;
>>>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>>>> throw new RuntimeException(e);
>>>>> } finally {
>>>>> try {
>>>>> if (searcher != null) searcher.close();
>>>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>>>> throw new RuntimeException(e);
>>>>> }
>>>>> }
>>>>> }
>>>>>
>>>>> // public float match(String text, Query query) {
>>>>> // return match(createIndex(text), query);
>>>>> // }
>>>>>
>>>>> public Directory createIndex(String text) {
>>>>> Directory dir = new RAMDirectory();
>>>>> IndexWriter writer = null;
>>>>> try {
>>>>> writer = new IndexWriter(dir, analyzer, true);
>>>>> // writer.setUseCompoundFile(false);
>>>>> // writer.mergeFactor = 2;
>>>>> // writer.minMergeDocs = 1;
>>>>> // writer.maxMergeDocs = 1;
>>>>>
>>>>> writer.addDocument(createDocument(text));
>>>>> // writer.optimize();
>>>>> return dir;
>>>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>>>> throw new RuntimeException(e);
>>>>> } finally {
>>>>> try {
>>>>> if (writer != null) writer.close();
>>>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>>>> throw new RuntimeException(e);
>>>>> }
>>>>> }
>>>>> }
>>>>>
>>>>> private Document createDocument(String content) {
>>>>> Document doc = new Document();
>>>>> doc.add(Field.UnStored("content", content));
>>>>> // doc.add(Field.Text("x", content));
>>>>> return doc;
>>>>> }
>>>>>
>>>>> /**
>>>>> * Lucene microbenchmark simulating STREAMING XQuery fulltext
>>>>> search
>>>>> as
>>>>> * typical for XML network routers, message queuing system, P2P
>>>>> networks,
>>>>> * etc. In this on-the-fly main memory indexing scenario, each
>>>>> individual
>>>>> * string is immediately matched as soon as it becomes available
>>>>> without any
>>>>> * persistance involved. This usage scenario and corresponding
>>>>> performance
>>>>> * profile is quite different in comparison to fulltext search
>>>>> over
>>>>> * persistent (read-mostly) indexes.
>>>>> *
>>>>> * Example XPath:
>>>>> count(/table/row[lucene:match(string(./firstname),
>>>>> * "James") > 0.0])
>>>>> */
>>>>> public static void main(String[] args) throws Exception {
>>>>> int k = -1;
>>>>> int runs = 5;
>>>>> if (args.length > ++k) runs = Integer.parseInt(args[k]);
>>>>>
>>>>> int nodes = 10000;
>>>>> if (args.length > ++k) nodes = Integer.parseInt(args[k]);
>>>>>
>>>>> String content = "James is out in the woods";
>>>>> if (args.length > ++k) content = args[k];
>>>>>
>>>>> String expression = "James";
>>>>> if (args.length > ++k) expression = args[k];
>>>>>
>>>>> LuceneMatcher matcher = new LuceneMatcher();
>>>>> Query query = matcher.parseQuery(expression); // to be reused N
>>>>> times
>>>>>
>>>>> for (int r = 0; r < runs; r++) {
>>>>> long start = System.currentTimeMillis();
>>>>> int matches = 0;
>>>>>
>>>>> for (int i = 0; i < nodes; i++) {
>>>>> // if (LuceneUtil.match(content + i, expression) > 0.0f) {
>>>>> if (matcher.match(matcher.createIndex(content + i), query) >
>>>>> 0.0f) {
>>>>> matches++;
>>>>> }
>>>>> }
>>>>>
>>>>> long end = System.currentTimeMillis();
>>>>> System.out.println("matches=" + matches);
>>>>> System.out.println("secs=" + ((end-start) / 1000.0f));
>>>>> System.out.println("queries/sec=" + (nodes / ((end-start) /
>>>>> 1000.0f)));
>>>>> System.out.println();
>>>>> }
>>>>> }
>>>>> }
>
> public class StringIndexReader extends IndexReader {
> private List terms;
> public StringIndexReader(String strings[]) {
> super(null);
> terms = Arrays.asList(strings);
> Collections.sort(terms);
> }
>
> public TermFreqVector[] getTermFreqVectors(int docNumber) throws
> IOException {
> System.out.println("StringIndexReader.getTermFreqVectors");
> return new TermFreqVector[0];
> }
>
> public TermFreqVector getTermFreqVector(int docNumber, String field)
> throws IOException {
> System.out.println("StringIndexReader.getTermFreqVector");
> return null;
> }
>
> public int numDocs() {
> System.out.println("StringIndexReader.numDocs");
> return 1;
> }
>
> public int maxDoc() {
> System.out.println("StringIndexReader.maxDoc");
> return 1;
> }
>
> public Document document(int n) throws IOException {
> System.out.println("StringIndexReader.document");
> return null;
> }
>
> public boolean isDeleted(int n) {
> System.out.println("StringIndexReader.isDeleted");
> return false;
> }
>
> public boolean hasDeletions() {
> System.out.println("StringIndexReader.hasDeletions");
> return false;
> }
>
> public byte[] norms(String field) throws IOException {
> // TODO: what value to use for this?
> System.out.println("StringIndexReader.norms: " + field);
> return new byte[] { 1 };
> }
>
> public void norms(String field, byte[] bytes, int offset) throws
> IOException {
> System.out.println("StringIndexReader.norms: " + field + "*");
> }
>
> protected void doSetNorm(int doc, String field, byte value) throws
> IOException {
> System.out.println("StringIndexReader.doSetNorm");
>
> }
>
> public TermEnum terms() throws IOException {
> System.out.println("StringIndexReader.terms");
> return terms(null);
> }
>
> public TermEnum terms(final Term term) throws IOException {
> System.out.println("StringIndexReader.terms: " + term);
>
> TermEnum termEnum = new TermEnum() {
> private String currentTerm;
> private Iterator iter;
>
> public boolean next() {
> System.out.println("TermEnum.next");
> if (iter.hasNext())
> currentTerm = (String) iter.next();
> return iter.hasNext();
> }
>
> public Term term() {
> if (iter == null) {
> iter = terms.iterator();
> while (next()) {
> if (currentTerm.startsWith(term.text()))
> break;
> }
> }
> System.out.println("TermEnum.term: " + currentTerm);
> return new Term(term.field(), currentTerm);
> }
>
> public int docFreq() {
> System.out.println("TermEnum.docFreq");
> return 1;
> }
>
> public void close() {
> System.out.println("TermEnum.close");
> }
> };
> return termEnum;
> }
>
> public int docFreq(Term term) throws IOException {
> System.out.println("StringIndexReader.docFreq: " + term);
> return terms.contains(term.text()) ? 1 : 0;
> }
>
> public TermDocs termDocs() throws IOException {
> System.out.println("StringIndexReader.termDocs");
>
> TermDocs td = new TermDocs() {
> private boolean done = false;
> String currentTerm;
>
> public void seek(Term term) {
> System.out.println(".seek: " + term);
> currentTerm = term.text();
> done = false;
> }
>
> public void seek(TermEnum termEnum) {
> seek(termEnum.term());
> }
>
> public int doc() {
> System.out.println(".doc");
> return 0;
> }
>
> public int freq() {
> System.out.println(".freq");
> return 1;
> }
>
> public boolean next() {
> System.out.println(".next");
> return false;
> }
>
> public int read(int[] docs, int[] freqs) {
> System.out.println(".read: " + docs.length);
>
> if (done) return 0;
>
> done = true;
> docs[0] = 0;
> freqs[0] = freq();
> return 1;
> }
>
> public boolean skipTo(int target) {
> System.out.println(".skipTo");
> return false;
> }
>
> public void close() {
> System.out.println(".close");
>
> }
> };
> return td;
> }
>
> public TermPositions termPositions() throws IOException {
> System.out.println("StringIndexReader.termPositions");
> return null;
> }
>
> protected void doDelete(int docNum) throws IOException {
> System.out.println("StringIndexReader.doDelete");
>
> }
>
> protected void doUndeleteAll() throws IOException {
> System.out.println("StringIndexReader.doUndeleteAll");
>
> }
>
> protected void doCommit() throws IOException {
> System.out.println("StringIndexReader.doCommit");
>
> }
>
> protected void doClose() throws IOException {
> System.out.println("StringIndexReader.doClose");
>
> }
>
> public Collection getFieldNames() throws IOException {
> System.out.println("StringIndexReader.getFieldNames");
> return null;
> }
>
> public Collection getFieldNames(boolean indexed) throws IOException {
> System.out.println("StringIndexReader.getFieldNames");
> return null;
> }
>
> public Collection getIndexedFieldNames(Field.TermVector tvSpec) {
> System.out.println("StringIndexReader.getIndexedFieldNames");
> return null;
> }
>
> public Collection getFieldNames(FieldOption fldOption) {
> System.out.println("StringIndexReader.getFieldNames");
> return null;
> }
>
> public static void main(String[] args) {
> IndexReader reader = new StringIndexReader(new String[] {"foo",
> "bar", "baz"});
> IndexSearcher searcher = new IndexSearcher(reader);
>
> Hits hits = null;
> try {
> hits = searcher.search(new WildcardQuery(new
> Term("field","ba*")));
> } catch (IOException e) {
> e.printStackTrace();
> }
> System.out.println("found " + hits.length());
> }
> }
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Wolfgang Hoschek <wh...@lbl.gov>.
On Apr 15, 2005, at 4:15 PM, Doug Cutting wrote:
> Wolfgang Hoschek wrote:
>> The classic fuzzy fulltext search and similarity matching that Lucene
>> is good for :-)
>
> So you need a score that can be compared to other matches? This will
> be based on nothing but term frequency, which a regex can compute.
> With a single document there'll be no IDFs, so you could simply sum
> sqrt() of term regex match counts, and divide by the sqrt of the
> length of the string.
Is there a function f that can translate any lucene query (with all its
syntax and fuzzy features) to a regex? E.g. how to translate
StandardAnalyzer or stemming into a regex? If so, yes, but that seems
unlikely, no?
My particular interest is to use XQuery for *precisely* locating
information subsets in networked XML messages, and then to use Lucene's
fulltext functionality for *fuzzy* searches within such a precise
subset. Messages are classified and routed/forwarded accordingly. See
http://dsd.lbl.gov/nux/ for background. [BTW, XQuery already has
regexes built-in].
>
> Yes, I'm playing devil's advocate...
Always a good thing to check assumptions :-)
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
RE: [Performance] Streaming main memory indexing of single strings
Posted by Robert Engels <re...@ix.netcom.com>.
I think one of the advantages may be the analyzers and processors that are
already available for several documents types.
Using regex with these is nearly impossible.
-----Original Message-----
From: Doug Cutting [mailto:cutting@apache.org]
Sent: Friday, April 15, 2005 6:16 PM
To: java-dev@lucene.apache.org
Subject: Re: [Performance] Streaming main memory indexing of single
strings
Wolfgang Hoschek wrote:
> The classic fuzzy fulltext search and similarity matching that Lucene is
> good for :-)
So you need a score that can be compared to other matches? This will be
based on nothing but term frequency, which a regex can compute. With a
single document there'll be no IDFs, so you could simply sum sqrt() of
term regex match counts, and divide by the sqrt of the length of the string.
Yes, I'm playing devil's advocate...
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Doug Cutting <cu...@apache.org>.
Wolfgang Hoschek wrote:
> The classic fuzzy fulltext search and similarity matching that Lucene is
> good for :-)
So you need a score that can be compared to other matches? This will be
based on nothing but term frequency, which a regex can compute. With a
single document there'll be no IDFs, so you could simply sum sqrt() of
term regex match counts, and divide by the sqrt of the length of the string.
Yes, I'm playing devil's advocate...
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Wolfgang Hoschek <wh...@lbl.gov>.
On Apr 15, 2005, at 4:00 PM, Doug Cutting wrote:
> Erik Hatcher wrote:
>> I think something like this would make a handy addition to our
>> contrib area at least.
>
> Perhaps.
>
> What use cases cannot be met by regular expression matching?
>
> Doug
>
The classic fuzzy fulltext search and similarity matching that Lucene
is good for :-)
Wolfgang.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Doug Cutting <cu...@apache.org>.
Erik Hatcher wrote:
> I think something like this would make a handy addition to our contrib
> area at least.
Perhaps.
What use cases cannot be met by regular expression matching?
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Apr 14, 2005, at 5:11 PM, Robert Engels wrote:
> It is really not that involved. Just implement the abstract methods of
> IndexReader. And many cane be no-op'd because they will never be
> called in a
> "read only" situation.
>
> Methods related to normalization and such can also be no-op'd because
> you
> are only dealing with a single document.
>
> I would think you will find this approach at least an order of
> magnitude
> faster, if not 2.
I am intrigued by this and decided to mock a quick and dirty example of
such an IndexReader. After a little trial-and-error I got it working
at least for TermQuery and WildcardQuery. I've pasted my code below as
an example, but there is much room for improvement, especially in terms
of performance and also in keeping track of term frequency, and also it
would be nicer if it handled the analysis internally.
I think something like this would make a handy addition to our contrib
area at least. I'd be happy to receive improvements to this and then
add it to a contrib subproject.
Perhaps this would be a handy way to handle situations where users have
queries saved in a system and need to be alerted whenever a new
document arrives matching the saved queries?
Erik
>
>
> -----Original Message-----
> From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
> Sent: Thursday, April 14, 2005 4:04 PM
> To: java-dev@lucene.apache.org
> Subject: Re: [Performance] Streaming main memory indexing of single
> strings
>
>
> This seems to be a promising avenue worth exploring. My gutfeeling is
> that this could easily be 10-100 times faster.
>
> The drawback is that it requires a fair amount of understanding of
> intricate Lucene internals, pulling those pieces together and adapting
> them as required for the seemingly simple "float match(String text,
> Query query)".
>
> I might give it a shot but I'm not sure I'll be able to pull this off!
> Is there any similar code I could look at as a starting point?
>
> Wolfgang.
>
> On Apr 14, 2005, at 1:13 PM, Robert Engels wrote:
>
>> I think you are not approaching this the correct way.
>>
>> Pseudo code:
>>
>> Subclass IndexReader.
>>
>> Get tokens from String 'document' using Lucene analyzers.
>>
>> Build simple hash-map based data structures using tokens for terms,
>> and term
>> positions.
>>
>> reimplement termDocs() and termPositions() to use the structures from
>> above.
>>
>> run searches.
>>
>> start again with next document.
>>
>>
>>
>> -----Original Message-----
>> From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
>> Sent: Thursday, April 14, 2005 2:56 PM
>> To: java-dev@lucene.apache.org
>> Subject: Re: [Performance] Streaming main memory indexing of single
>> strings
>>
>>
>> Otis, this might be a misunderstanding.
>>
>> - I'm not calling optimize(). That piece is commented out you if look
>> again at the code.
>> - The *streaming* use case requires that for each query I add one (and
>> only one) document (aka string) to an empty index:
>>
>> repeat N times (where N is millions or billions):
>> add a single string (aka document) to an empty index
>> query the index
>> drop index (or delete it's document)
>>
>> with the following API being called N times: float match(String text,
>> Query query)
>>
>> So there's no possibility of adding many documents and thereafter
>> running the query. This in turn seems to mean that the IndexWriter
>> can't be kept open - unless I manually delete each document after each
>> query to repeatedly reuse the RAMDirectory, which I've also tried
>> before without any significant performance gain - deletion seems to
>> have substantial overhead in itself. Perhaps it would be better if
>> there were a Directory.deleteAllDocuments() or similar. Did you have
>> some other approach in mind?
>>
>> As I said, Lucene's design doesn't seem to fit this streaming use case
>> pattern well. In *this* scenario one could easily do without any
>> locking, and without byte level organization in RAMDirectory and
>> RAMFile, etc because a single small string isn't a large persistent
>> multi-document index.
>>
>> For some background, here's a small example for the kind of XQuery
>> functionality Nux/Lucene integration enables:
>>
>> (: An XQuery that finds all books authored by James that have
>> something
>> to do with "fish", sorted by relevance :)
>> declare namespace lucene = "java:nux.xom.xquery.XQueryUtil";
>> declare variable $query := "fish*~";
>>
>> for $book in /books/book[author="James" and lucene:match(string(.),
>> $query) > 0.0]
>> let $score := lucene:match(string($book), $query)
>> order by $score descending
>> return (<score>{$score}</score>, $book)
>>
>> More interestingly one can use this for classifying and routing XML
>> messages based on rules (i.e. queries) inspecting their content...
>>
>> Any other clues about potential improvements would be greatly
>> appreciated.
>>
>> Wolfgang.
>>
>> On Apr 13, 2005, at 10:09 PM, Otis Gospodnetic wrote:
>>
>>> It looks like you are calling that IndexWriter code in some loops,
>>> opening it and closing it in every iteration of the loop and also
>>> calling optimize. All of those things could be improved.
>>> Keep your IndexWriter open, don't close it, and optimize the index
>>> only
>>> once you are done adding documents to it.
>>>
>>> See the highlights and the snipets in the first hit:
>>> http://www.lucenebook.com/search?query=when+to+optimize
>>>
>>> Otis
>>>
>>>
>>> --- Wolfgang Hoschek <wh...@lbl.gov> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm wondering if anyone could let me know how to improve Lucene
>>>> performance for "streaming main memory indexing of single strings".
>>>> This would help to effectively integrate Lucene with the Nux XQuery
>>>> engine.
>>>>
>>>> Below is a small microbenchmark simulating STREAMING XQuery fulltext
>>>> search as typical for XML network routers, message queuing system,
>>>> P2P
>>>> networks, etc. In this on-the-fly main memory indexing scenario,
>>>> each
>>>>
>>>> individual string is immediately matched as soon as it becomes
>>>> available without any persistance involved. This usage scenario and
>>>> corresponding performance profile is quite different in comparison
>>>> to
>>>>
>>>> fulltext search over persistent (read-mostly) indexes.
>>>>
>>>> The benchmark runs at some 3000 lucene queries/sec (lucene-1.4.3)
>>>> which
>>>> is unfortunate news considering the XQuery engine can easily walk
>>>> hundreds of thousands of XML nodes per second. Ideally I'd like to
>>>> run
>>>> at some 100000 queries/sec. Runnning this through the JDK 1.5
>>>> profiler
>>>> it seems that most time is spent in and below the following calls:
>>>>
>>>> writer = new IndexWriter(dir, analyzer, true);
>>>> writer.addDocument(...);
>>>> writer.close();
>>>>
>>>> I tried quite a few variants of the benchmark with various options,
>>>> unfortunately with little or no effect.
>>>> Lucene just does not seem to designed to do this sort of "transient
>>>> single string index" thing. All code paths related to opening,
>>>> closing,
>>>> reading, writing, querying and object creation seem to be designed
>>>> for
>>>> large persistent indexes.
>>>>
>>>> Any advice on what I'm missing or what could be done about it would
>>>> be
>>>> greatly appreciated.
>>>>
>>>> Wolfgang.
>>>>
>>>> P.S. the benchmark code is attached as a file below:
>>>>
>>>>> package nux.xom.pool;
>>>>
>>>> import java.io.IOException;
>>>> //import java.io.Reader;
>>>>
>>>> import org.apache.lucene.analysis.Analyzer;
>>>> //import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>> //import org.apache.lucene.analysis.PorterStemFilter;
>>>> //import org.apache.lucene.analysis.SimpleAnalyzer;
>>>> //import org.apache.lucene.analysis.TokenStream;
>>>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>>>> import org.apache.lucene.document.Document;
>>>> import org.apache.lucene.document.Field;
>>>> //import org.apache.lucene.index.IndexReader;
>>>> import org.apache.lucene.index.IndexWriter;
>>>> import org.apache.lucene.queryParser.ParseException;
>>>> import org.apache.lucene.queryParser.QueryParser;
>>>> import org.apache.lucene.search.Hits;
>>>> import org.apache.lucene.search.IndexSearcher;
>>>> import org.apache.lucene.search.Query;
>>>> import org.apache.lucene.search.Searcher;
>>>> import org.apache.lucene.store.Directory;
>>>> import org.apache.lucene.store.RAMDirectory;
>>>>
>>>> public final class LuceneMatcher { // TODO: make non-public
>>>>
>>>> private final Analyzer analyzer;
>>>> // private final Directory dir = new RAMDirectory();
>>>>
>>>> public LuceneMatcher() {
>>>> this(new StandardAnalyzer());
>>>> // this(new SimpleAnalyzer());
>>>> // this(new StopAnalyzer());
>>>> // this(new Analyzer() {
>>>> // public final TokenStream tokenStream(String fieldName, Reader
>>>> reader) {
>>>> // return new PorterStemFilter(new LowerCaseTokenizer(reader));
>>>> // }
>>>> // });
>>>> }
>>>>
>>>> public LuceneMatcher(Analyzer analyzer) {
>>>> if (analyzer == null)
>>>> throw new IllegalArgumentException("analyzer must not be null");
>>>> this.analyzer = analyzer;
>>>> }
>>>>
>>>> public Query parseQuery(String expression) throws ParseException {
>>>> QueryParser parser = new QueryParser("content", analyzer);
>>>> // parser.setPhraseSlop(0);
>>>> return parser.parse(expression);
>>>> }
>>>>
>>>> /**
>>>> * Returns the relevance score by matching the given index against
>>>> the given
>>>> * Lucene query expression. The index must not contain more than
>>>> one
>>>> Lucene
>>>> * "document" (aka string to be searched).
>>>> */
>>>> public float match(Directory index, Query query) {
>>>> Searcher searcher = null;
>>>> try {
>>>> searcher = new IndexSearcher(index);
>>>> Hits hits = searcher.search(query);
>>>> float score = hits.length() > 0 ? hits.score(0) : 0.0f;
>>>> return score;
>>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>>> throw new RuntimeException(e);
>>>> } finally {
>>>> try {
>>>> if (searcher != null) searcher.close();
>>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>>> throw new RuntimeException(e);
>>>> }
>>>> }
>>>> }
>>>>
>>>> // public float match(String text, Query query) {
>>>> // return match(createIndex(text), query);
>>>> // }
>>>>
>>>> public Directory createIndex(String text) {
>>>> Directory dir = new RAMDirectory();
>>>> IndexWriter writer = null;
>>>> try {
>>>> writer = new IndexWriter(dir, analyzer, true);
>>>> // writer.setUseCompoundFile(false);
>>>> // writer.mergeFactor = 2;
>>>> // writer.minMergeDocs = 1;
>>>> // writer.maxMergeDocs = 1;
>>>>
>>>> writer.addDocument(createDocument(text));
>>>> // writer.optimize();
>>>> return dir;
>>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>>> throw new RuntimeException(e);
>>>> } finally {
>>>> try {
>>>> if (writer != null) writer.close();
>>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>>> throw new RuntimeException(e);
>>>> }
>>>> }
>>>> }
>>>>
>>>> private Document createDocument(String content) {
>>>> Document doc = new Document();
>>>> doc.add(Field.UnStored("content", content));
>>>> // doc.add(Field.Text("x", content));
>>>> return doc;
>>>> }
>>>>
>>>> /**
>>>> * Lucene microbenchmark simulating STREAMING XQuery fulltext
>>>> search
>>>> as
>>>> * typical for XML network routers, message queuing system, P2P
>>>> networks,
>>>> * etc. In this on-the-fly main memory indexing scenario, each
>>>> individual
>>>> * string is immediately matched as soon as it becomes available
>>>> without any
>>>> * persistance involved. This usage scenario and corresponding
>>>> performance
>>>> * profile is quite different in comparison to fulltext search over
>>>> * persistent (read-mostly) indexes.
>>>> *
>>>> * Example XPath:
>>>> count(/table/row[lucene:match(string(./firstname),
>>>> * "James") > 0.0])
>>>> */
>>>> public static void main(String[] args) throws Exception {
>>>> int k = -1;
>>>> int runs = 5;
>>>> if (args.length > ++k) runs = Integer.parseInt(args[k]);
>>>>
>>>> int nodes = 10000;
>>>> if (args.length > ++k) nodes = Integer.parseInt(args[k]);
>>>>
>>>> String content = "James is out in the woods";
>>>> if (args.length > ++k) content = args[k];
>>>>
>>>> String expression = "James";
>>>> if (args.length > ++k) expression = args[k];
>>>>
>>>> LuceneMatcher matcher = new LuceneMatcher();
>>>> Query query = matcher.parseQuery(expression); // to be reused N
>>>> times
>>>>
>>>> for (int r = 0; r < runs; r++) {
>>>> long start = System.currentTimeMillis();
>>>> int matches = 0;
>>>>
>>>> for (int i = 0; i < nodes; i++) {
>>>> // if (LuceneUtil.match(content + i, expression) > 0.0f) {
>>>> if (matcher.match(matcher.createIndex(content + i), query) >
>>>> 0.0f) {
>>>> matches++;
>>>> }
>>>> }
>>>>
>>>> long end = System.currentTimeMillis();
>>>> System.out.println("matches=" + matches);
>>>> System.out.println("secs=" + ((end-start) / 1000.0f));
>>>> System.out.println("queries/sec=" + (nodes / ((end-start) /
>>>> 1000.0f)));
>>>> System.out.println();
>>>> }
>>>> }
>>>> }
public class StringIndexReader extends IndexReader {
private List terms;
public StringIndexReader(String strings[]) {
super(null);
terms = Arrays.asList(strings);
Collections.sort(terms);
}
public TermFreqVector[] getTermFreqVectors(int docNumber) throws
IOException {
System.out.println("StringIndexReader.getTermFreqVectors");
return new TermFreqVector[0];
}
public TermFreqVector getTermFreqVector(int docNumber, String field)
throws IOException {
System.out.println("StringIndexReader.getTermFreqVector");
return null;
}
public int numDocs() {
System.out.println("StringIndexReader.numDocs");
return 1;
}
public int maxDoc() {
System.out.println("StringIndexReader.maxDoc");
return 1;
}
public Document document(int n) throws IOException {
System.out.println("StringIndexReader.document");
return null;
}
public boolean isDeleted(int n) {
System.out.println("StringIndexReader.isDeleted");
return false;
}
public boolean hasDeletions() {
System.out.println("StringIndexReader.hasDeletions");
return false;
}
public byte[] norms(String field) throws IOException {
// TODO: what value to use for this?
System.out.println("StringIndexReader.norms: " + field);
return new byte[] { 1 };
}
public void norms(String field, byte[] bytes, int offset) throws
IOException {
System.out.println("StringIndexReader.norms: " + field + "*");
}
protected void doSetNorm(int doc, String field, byte value) throws
IOException {
System.out.println("StringIndexReader.doSetNorm");
}
public TermEnum terms() throws IOException {
System.out.println("StringIndexReader.terms");
return terms(null);
}
public TermEnum terms(final Term term) throws IOException {
System.out.println("StringIndexReader.terms: " + term);
TermEnum termEnum = new TermEnum() {
private String currentTerm;
private Iterator iter;
public boolean next() {
System.out.println("TermEnum.next");
if (iter.hasNext())
currentTerm = (String) iter.next();
return iter.hasNext();
}
public Term term() {
if (iter == null) {
iter = terms.iterator();
while (next()) {
if (currentTerm.startsWith(term.text()))
break;
}
}
System.out.println("TermEnum.term: " + currentTerm);
return new Term(term.field(), currentTerm);
}
public int docFreq() {
System.out.println("TermEnum.docFreq");
return 1;
}
public void close() {
System.out.println("TermEnum.close");
}
};
return termEnum;
}
public int docFreq(Term term) throws IOException {
System.out.println("StringIndexReader.docFreq: " + term);
return terms.contains(term.text()) ? 1 : 0;
}
public TermDocs termDocs() throws IOException {
System.out.println("StringIndexReader.termDocs");
TermDocs td = new TermDocs() {
private boolean done = false;
String currentTerm;
public void seek(Term term) {
System.out.println(".seek: " + term);
currentTerm = term.text();
done = false;
}
public void seek(TermEnum termEnum) {
seek(termEnum.term());
}
public int doc() {
System.out.println(".doc");
return 0;
}
public int freq() {
System.out.println(".freq");
return 1;
}
public boolean next() {
System.out.println(".next");
return false;
}
public int read(int[] docs, int[] freqs) {
System.out.println(".read: " + docs.length);
if (done) return 0;
done = true;
docs[0] = 0;
freqs[0] = freq();
return 1;
}
public boolean skipTo(int target) {
System.out.println(".skipTo");
return false;
}
public void close() {
System.out.println(".close");
}
};
return td;
}
public TermPositions termPositions() throws IOException {
System.out.println("StringIndexReader.termPositions");
return null;
}
protected void doDelete(int docNum) throws IOException {
System.out.println("StringIndexReader.doDelete");
}
protected void doUndeleteAll() throws IOException {
System.out.println("StringIndexReader.doUndeleteAll");
}
protected void doCommit() throws IOException {
System.out.println("StringIndexReader.doCommit");
}
protected void doClose() throws IOException {
System.out.println("StringIndexReader.doClose");
}
public Collection getFieldNames() throws IOException {
System.out.println("StringIndexReader.getFieldNames");
return null;
}
public Collection getFieldNames(boolean indexed) throws IOException {
System.out.println("StringIndexReader.getFieldNames");
return null;
}
public Collection getIndexedFieldNames(Field.TermVector tvSpec) {
System.out.println("StringIndexReader.getIndexedFieldNames");
return null;
}
public Collection getFieldNames(FieldOption fldOption) {
System.out.println("StringIndexReader.getFieldNames");
return null;
}
public static void main(String[] args) {
IndexReader reader = new StringIndexReader(new String[] {"foo",
"bar", "baz"});
IndexSearcher searcher = new IndexSearcher(reader);
Hits hits = null;
try {
hits = searcher.search(new WildcardQuery(new
Term("field","ba*")));
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("found " + hits.length());
}
}
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
RE: [Performance] Streaming main memory indexing of single strings
Posted by Robert Engels <re...@ix.netcom.com>.
It is really not that involved. Just implement the abstract methods of
IndexReader. And many cane be no-op'd because they will never be called in a
"read only" situation.
Methods related to normalization and such can also be no-op'd because you
are only dealing with a single document.
I would think you will find this approach at least an order of magnitude
faster, if not 2.
-----Original Message-----
From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
Sent: Thursday, April 14, 2005 4:04 PM
To: java-dev@lucene.apache.org
Subject: Re: [Performance] Streaming main memory indexing of single
strings
This seems to be a promising avenue worth exploring. My gutfeeling is
that this could easily be 10-100 times faster.
The drawback is that it requires a fair amount of understanding of
intricate Lucene internals, pulling those pieces together and adapting
them as required for the seemingly simple "float match(String text,
Query query)".
I might give it a shot but I'm not sure I'll be able to pull this off!
Is there any similar code I could look at as a starting point?
Wolfgang.
On Apr 14, 2005, at 1:13 PM, Robert Engels wrote:
> I think you are not approaching this the correct way.
>
> Pseudo code:
>
> Subclass IndexReader.
>
> Get tokens from String 'document' using Lucene analyzers.
>
> Build simple hash-map based data structures using tokens for terms,
> and term
> positions.
>
> reimplement termDocs() and termPositions() to use the structures from
> above.
>
> run searches.
>
> start again with next document.
>
>
>
> -----Original Message-----
> From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
> Sent: Thursday, April 14, 2005 2:56 PM
> To: java-dev@lucene.apache.org
> Subject: Re: [Performance] Streaming main memory indexing of single
> strings
>
>
> Otis, this might be a misunderstanding.
>
> - I'm not calling optimize(). That piece is commented out you if look
> again at the code.
> - The *streaming* use case requires that for each query I add one (and
> only one) document (aka string) to an empty index:
>
> repeat N times (where N is millions or billions):
> add a single string (aka document) to an empty index
> query the index
> drop index (or delete it's document)
>
> with the following API being called N times: float match(String text,
> Query query)
>
> So there's no possibility of adding many documents and thereafter
> running the query. This in turn seems to mean that the IndexWriter
> can't be kept open - unless I manually delete each document after each
> query to repeatedly reuse the RAMDirectory, which I've also tried
> before without any significant performance gain - deletion seems to
> have substantial overhead in itself. Perhaps it would be better if
> there were a Directory.deleteAllDocuments() or similar. Did you have
> some other approach in mind?
>
> As I said, Lucene's design doesn't seem to fit this streaming use case
> pattern well. In *this* scenario one could easily do without any
> locking, and without byte level organization in RAMDirectory and
> RAMFile, etc because a single small string isn't a large persistent
> multi-document index.
>
> For some background, here's a small example for the kind of XQuery
> functionality Nux/Lucene integration enables:
>
> (: An XQuery that finds all books authored by James that have something
> to do with "fish", sorted by relevance :)
> declare namespace lucene = "java:nux.xom.xquery.XQueryUtil";
> declare variable $query := "fish*~";
>
> for $book in /books/book[author="James" and lucene:match(string(.),
> $query) > 0.0]
> let $score := lucene:match(string($book), $query)
> order by $score descending
> return (<score>{$score}</score>, $book)
>
> More interestingly one can use this for classifying and routing XML
> messages based on rules (i.e. queries) inspecting their content...
>
> Any other clues about potential improvements would be greatly
> appreciated.
>
> Wolfgang.
>
> On Apr 13, 2005, at 10:09 PM, Otis Gospodnetic wrote:
>
>> It looks like you are calling that IndexWriter code in some loops,
>> opening it and closing it in every iteration of the loop and also
>> calling optimize. All of those things could be improved.
>> Keep your IndexWriter open, don't close it, and optimize the index
>> only
>> once you are done adding documents to it.
>>
>> See the highlights and the snipets in the first hit:
>> http://www.lucenebook.com/search?query=when+to+optimize
>>
>> Otis
>>
>>
>> --- Wolfgang Hoschek <wh...@lbl.gov> wrote:
>>
>>> Hi,
>>>
>>> I'm wondering if anyone could let me know how to improve Lucene
>>> performance for "streaming main memory indexing of single strings".
>>> This would help to effectively integrate Lucene with the Nux XQuery
>>> engine.
>>>
>>> Below is a small microbenchmark simulating STREAMING XQuery fulltext
>>> search as typical for XML network routers, message queuing system,
>>> P2P
>>> networks, etc. In this on-the-fly main memory indexing scenario, each
>>>
>>> individual string is immediately matched as soon as it becomes
>>> available without any persistance involved. This usage scenario and
>>> corresponding performance profile is quite different in comparison to
>>>
>>> fulltext search over persistent (read-mostly) indexes.
>>>
>>> The benchmark runs at some 3000 lucene queries/sec (lucene-1.4.3)
>>> which
>>> is unfortunate news considering the XQuery engine can easily walk
>>> hundreds of thousands of XML nodes per second. Ideally I'd like to
>>> run
>>> at some 100000 queries/sec. Runnning this through the JDK 1.5
>>> profiler
>>> it seems that most time is spent in and below the following calls:
>>>
>>> writer = new IndexWriter(dir, analyzer, true);
>>> writer.addDocument(...);
>>> writer.close();
>>>
>>> I tried quite a few variants of the benchmark with various options,
>>> unfortunately with little or no effect.
>>> Lucene just does not seem to designed to do this sort of "transient
>>> single string index" thing. All code paths related to opening,
>>> closing,
>>> reading, writing, querying and object creation seem to be designed
>>> for
>>> large persistent indexes.
>>>
>>> Any advice on what I'm missing or what could be done about it would
>>> be
>>> greatly appreciated.
>>>
>>> Wolfgang.
>>>
>>> P.S. the benchmark code is attached as a file below:
>>>
>>>> package nux.xom.pool;
>>>
>>> import java.io.IOException;
>>> //import java.io.Reader;
>>>
>>> import org.apache.lucene.analysis.Analyzer;
>>> //import org.apache.lucene.analysis.LowerCaseTokenizer;
>>> //import org.apache.lucene.analysis.PorterStemFilter;
>>> //import org.apache.lucene.analysis.SimpleAnalyzer;
>>> //import org.apache.lucene.analysis.TokenStream;
>>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>>> import org.apache.lucene.document.Document;
>>> import org.apache.lucene.document.Field;
>>> //import org.apache.lucene.index.IndexReader;
>>> import org.apache.lucene.index.IndexWriter;
>>> import org.apache.lucene.queryParser.ParseException;
>>> import org.apache.lucene.queryParser.QueryParser;
>>> import org.apache.lucene.search.Hits;
>>> import org.apache.lucene.search.IndexSearcher;
>>> import org.apache.lucene.search.Query;
>>> import org.apache.lucene.search.Searcher;
>>> import org.apache.lucene.store.Directory;
>>> import org.apache.lucene.store.RAMDirectory;
>>>
>>> public final class LuceneMatcher { // TODO: make non-public
>>>
>>> private final Analyzer analyzer;
>>> // private final Directory dir = new RAMDirectory();
>>>
>>> public LuceneMatcher() {
>>> this(new StandardAnalyzer());
>>> // this(new SimpleAnalyzer());
>>> // this(new StopAnalyzer());
>>> // this(new Analyzer() {
>>> // public final TokenStream tokenStream(String fieldName, Reader
>>> reader) {
>>> // return new PorterStemFilter(new LowerCaseTokenizer(reader));
>>> // }
>>> // });
>>> }
>>>
>>> public LuceneMatcher(Analyzer analyzer) {
>>> if (analyzer == null)
>>> throw new IllegalArgumentException("analyzer must not be null");
>>> this.analyzer = analyzer;
>>> }
>>>
>>> public Query parseQuery(String expression) throws ParseException {
>>> QueryParser parser = new QueryParser("content", analyzer);
>>> // parser.setPhraseSlop(0);
>>> return parser.parse(expression);
>>> }
>>>
>>> /**
>>> * Returns the relevance score by matching the given index against
>>> the given
>>> * Lucene query expression. The index must not contain more than one
>>> Lucene
>>> * "document" (aka string to be searched).
>>> */
>>> public float match(Directory index, Query query) {
>>> Searcher searcher = null;
>>> try {
>>> searcher = new IndexSearcher(index);
>>> Hits hits = searcher.search(query);
>>> float score = hits.length() > 0 ? hits.score(0) : 0.0f;
>>> return score;
>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>> throw new RuntimeException(e);
>>> } finally {
>>> try {
>>> if (searcher != null) searcher.close();
>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>> throw new RuntimeException(e);
>>> }
>>> }
>>> }
>>>
>>> // public float match(String text, Query query) {
>>> // return match(createIndex(text), query);
>>> // }
>>>
>>> public Directory createIndex(String text) {
>>> Directory dir = new RAMDirectory();
>>> IndexWriter writer = null;
>>> try {
>>> writer = new IndexWriter(dir, analyzer, true);
>>> // writer.setUseCompoundFile(false);
>>> // writer.mergeFactor = 2;
>>> // writer.minMergeDocs = 1;
>>> // writer.maxMergeDocs = 1;
>>>
>>> writer.addDocument(createDocument(text));
>>> // writer.optimize();
>>> return dir;
>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>> throw new RuntimeException(e);
>>> } finally {
>>> try {
>>> if (writer != null) writer.close();
>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>> throw new RuntimeException(e);
>>> }
>>> }
>>> }
>>>
>>> private Document createDocument(String content) {
>>> Document doc = new Document();
>>> doc.add(Field.UnStored("content", content));
>>> // doc.add(Field.Text("x", content));
>>> return doc;
>>> }
>>>
>>> /**
>>> * Lucene microbenchmark simulating STREAMING XQuery fulltext search
>>> as
>>> * typical for XML network routers, message queuing system, P2P
>>> networks,
>>> * etc. In this on-the-fly main memory indexing scenario, each
>>> individual
>>> * string is immediately matched as soon as it becomes available
>>> without any
>>> * persistance involved. This usage scenario and corresponding
>>> performance
>>> * profile is quite different in comparison to fulltext search over
>>> * persistent (read-mostly) indexes.
>>> *
>>> * Example XPath: count(/table/row[lucene:match(string(./firstname),
>>> * "James") > 0.0])
>>> */
>>> public static void main(String[] args) throws Exception {
>>> int k = -1;
>>> int runs = 5;
>>> if (args.length > ++k) runs = Integer.parseInt(args[k]);
>>>
>>> int nodes = 10000;
>>> if (args.length > ++k) nodes = Integer.parseInt(args[k]);
>>>
>>> String content = "James is out in the woods";
>>> if (args.length > ++k) content = args[k];
>>>
>>> String expression = "James";
>>> if (args.length > ++k) expression = args[k];
>>>
>>> LuceneMatcher matcher = new LuceneMatcher();
>>> Query query = matcher.parseQuery(expression); // to be reused N
>>> times
>>>
>>> for (int r = 0; r < runs; r++) {
>>> long start = System.currentTimeMillis();
>>> int matches = 0;
>>>
>>> for (int i = 0; i < nodes; i++) {
>>> // if (LuceneUtil.match(content + i, expression) > 0.0f) {
>>> if (matcher.match(matcher.createIndex(content + i), query) >
>>> 0.0f) {
>>> matches++;
>>> }
>>> }
>>>
>>> long end = System.currentTimeMillis();
>>> System.out.println("matches=" + matches);
>>> System.out.println("secs=" + ((end-start) / 1000.0f));
>>> System.out.println("queries/sec=" + (nodes / ((end-start) /
>>> 1000.0f)));
>>> System.out.println();
>>> }
>>> }
>>> }
>>>>
>>>
>>>>
>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Wolfgang Hoschek <wh...@lbl.gov>.
This seems to be a promising avenue worth exploring. My gutfeeling is
that this could easily be 10-100 times faster.
The drawback is that it requires a fair amount of understanding of
intricate Lucene internals, pulling those pieces together and adapting
them as required for the seemingly simple "float match(String text,
Query query)".
I might give it a shot but I'm not sure I'll be able to pull this off!
Is there any similar code I could look at as a starting point?
Wolfgang.
On Apr 14, 2005, at 1:13 PM, Robert Engels wrote:
> I think you are not approaching this the correct way.
>
> Pseudo code:
>
> Subclass IndexReader.
>
> Get tokens from String 'document' using Lucene analyzers.
>
> Build simple hash-map based data structures using tokens for terms,
> and term
> positions.
>
> reimplement termDocs() and termPositions() to use the structures from
> above.
>
> run searches.
>
> start again with next document.
>
>
>
> -----Original Message-----
> From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
> Sent: Thursday, April 14, 2005 2:56 PM
> To: java-dev@lucene.apache.org
> Subject: Re: [Performance] Streaming main memory indexing of single
> strings
>
>
> Otis, this might be a misunderstanding.
>
> - I'm not calling optimize(). That piece is commented out you if look
> again at the code.
> - The *streaming* use case requires that for each query I add one (and
> only one) document (aka string) to an empty index:
>
> repeat N times (where N is millions or billions):
> add a single string (aka document) to an empty index
> query the index
> drop index (or delete it's document)
>
> with the following API being called N times: float match(String text,
> Query query)
>
> So there's no possibility of adding many documents and thereafter
> running the query. This in turn seems to mean that the IndexWriter
> can't be kept open - unless I manually delete each document after each
> query to repeatedly reuse the RAMDirectory, which I've also tried
> before without any significant performance gain - deletion seems to
> have substantial overhead in itself. Perhaps it would be better if
> there were a Directory.deleteAllDocuments() or similar. Did you have
> some other approach in mind?
>
> As I said, Lucene's design doesn't seem to fit this streaming use case
> pattern well. In *this* scenario one could easily do without any
> locking, and without byte level organization in RAMDirectory and
> RAMFile, etc because a single small string isn't a large persistent
> multi-document index.
>
> For some background, here's a small example for the kind of XQuery
> functionality Nux/Lucene integration enables:
>
> (: An XQuery that finds all books authored by James that have something
> to do with "fish", sorted by relevance :)
> declare namespace lucene = "java:nux.xom.xquery.XQueryUtil";
> declare variable $query := "fish*~";
>
> for $book in /books/book[author="James" and lucene:match(string(.),
> $query) > 0.0]
> let $score := lucene:match(string($book), $query)
> order by $score descending
> return (<score>{$score}</score>, $book)
>
> More interestingly one can use this for classifying and routing XML
> messages based on rules (i.e. queries) inspecting their content...
>
> Any other clues about potential improvements would be greatly
> appreciated.
>
> Wolfgang.
>
> On Apr 13, 2005, at 10:09 PM, Otis Gospodnetic wrote:
>
>> It looks like you are calling that IndexWriter code in some loops,
>> opening it and closing it in every iteration of the loop and also
>> calling optimize. All of those things could be improved.
>> Keep your IndexWriter open, don't close it, and optimize the index
>> only
>> once you are done adding documents to it.
>>
>> See the highlights and the snipets in the first hit:
>> http://www.lucenebook.com/search?query=when+to+optimize
>>
>> Otis
>>
>>
>> --- Wolfgang Hoschek <wh...@lbl.gov> wrote:
>>
>>> Hi,
>>>
>>> I'm wondering if anyone could let me know how to improve Lucene
>>> performance for "streaming main memory indexing of single strings".
>>> This would help to effectively integrate Lucene with the Nux XQuery
>>> engine.
>>>
>>> Below is a small microbenchmark simulating STREAMING XQuery fulltext
>>> search as typical for XML network routers, message queuing system,
>>> P2P
>>> networks, etc. In this on-the-fly main memory indexing scenario, each
>>>
>>> individual string is immediately matched as soon as it becomes
>>> available without any persistance involved. This usage scenario and
>>> corresponding performance profile is quite different in comparison to
>>>
>>> fulltext search over persistent (read-mostly) indexes.
>>>
>>> The benchmark runs at some 3000 lucene queries/sec (lucene-1.4.3)
>>> which
>>> is unfortunate news considering the XQuery engine can easily walk
>>> hundreds of thousands of XML nodes per second. Ideally I'd like to
>>> run
>>> at some 100000 queries/sec. Runnning this through the JDK 1.5
>>> profiler
>>> it seems that most time is spent in and below the following calls:
>>>
>>> writer = new IndexWriter(dir, analyzer, true);
>>> writer.addDocument(...);
>>> writer.close();
>>>
>>> I tried quite a few variants of the benchmark with various options,
>>> unfortunately with little or no effect.
>>> Lucene just does not seem to designed to do this sort of "transient
>>> single string index" thing. All code paths related to opening,
>>> closing,
>>> reading, writing, querying and object creation seem to be designed
>>> for
>>> large persistent indexes.
>>>
>>> Any advice on what I'm missing or what could be done about it would
>>> be
>>> greatly appreciated.
>>>
>>> Wolfgang.
>>>
>>> P.S. the benchmark code is attached as a file below:
>>>
>>>> package nux.xom.pool;
>>>
>>> import java.io.IOException;
>>> //import java.io.Reader;
>>>
>>> import org.apache.lucene.analysis.Analyzer;
>>> //import org.apache.lucene.analysis.LowerCaseTokenizer;
>>> //import org.apache.lucene.analysis.PorterStemFilter;
>>> //import org.apache.lucene.analysis.SimpleAnalyzer;
>>> //import org.apache.lucene.analysis.TokenStream;
>>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>>> import org.apache.lucene.document.Document;
>>> import org.apache.lucene.document.Field;
>>> //import org.apache.lucene.index.IndexReader;
>>> import org.apache.lucene.index.IndexWriter;
>>> import org.apache.lucene.queryParser.ParseException;
>>> import org.apache.lucene.queryParser.QueryParser;
>>> import org.apache.lucene.search.Hits;
>>> import org.apache.lucene.search.IndexSearcher;
>>> import org.apache.lucene.search.Query;
>>> import org.apache.lucene.search.Searcher;
>>> import org.apache.lucene.store.Directory;
>>> import org.apache.lucene.store.RAMDirectory;
>>>
>>> public final class LuceneMatcher { // TODO: make non-public
>>>
>>> private final Analyzer analyzer;
>>> // private final Directory dir = new RAMDirectory();
>>>
>>> public LuceneMatcher() {
>>> this(new StandardAnalyzer());
>>> // this(new SimpleAnalyzer());
>>> // this(new StopAnalyzer());
>>> // this(new Analyzer() {
>>> // public final TokenStream tokenStream(String fieldName, Reader
>>> reader) {
>>> // return new PorterStemFilter(new LowerCaseTokenizer(reader));
>>> // }
>>> // });
>>> }
>>>
>>> public LuceneMatcher(Analyzer analyzer) {
>>> if (analyzer == null)
>>> throw new IllegalArgumentException("analyzer must not be null");
>>> this.analyzer = analyzer;
>>> }
>>>
>>> public Query parseQuery(String expression) throws ParseException {
>>> QueryParser parser = new QueryParser("content", analyzer);
>>> // parser.setPhraseSlop(0);
>>> return parser.parse(expression);
>>> }
>>>
>>> /**
>>> * Returns the relevance score by matching the given index against
>>> the given
>>> * Lucene query expression. The index must not contain more than one
>>> Lucene
>>> * "document" (aka string to be searched).
>>> */
>>> public float match(Directory index, Query query) {
>>> Searcher searcher = null;
>>> try {
>>> searcher = new IndexSearcher(index);
>>> Hits hits = searcher.search(query);
>>> float score = hits.length() > 0 ? hits.score(0) : 0.0f;
>>> return score;
>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>> throw new RuntimeException(e);
>>> } finally {
>>> try {
>>> if (searcher != null) searcher.close();
>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>> throw new RuntimeException(e);
>>> }
>>> }
>>> }
>>>
>>> // public float match(String text, Query query) {
>>> // return match(createIndex(text), query);
>>> // }
>>>
>>> public Directory createIndex(String text) {
>>> Directory dir = new RAMDirectory();
>>> IndexWriter writer = null;
>>> try {
>>> writer = new IndexWriter(dir, analyzer, true);
>>> // writer.setUseCompoundFile(false);
>>> // writer.mergeFactor = 2;
>>> // writer.minMergeDocs = 1;
>>> // writer.maxMergeDocs = 1;
>>>
>>> writer.addDocument(createDocument(text));
>>> // writer.optimize();
>>> return dir;
>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>> throw new RuntimeException(e);
>>> } finally {
>>> try {
>>> if (writer != null) writer.close();
>>> } catch (IOException e) { // should never happen (RAMDirectory)
>>> throw new RuntimeException(e);
>>> }
>>> }
>>> }
>>>
>>> private Document createDocument(String content) {
>>> Document doc = new Document();
>>> doc.add(Field.UnStored("content", content));
>>> // doc.add(Field.Text("x", content));
>>> return doc;
>>> }
>>>
>>> /**
>>> * Lucene microbenchmark simulating STREAMING XQuery fulltext search
>>> as
>>> * typical for XML network routers, message queuing system, P2P
>>> networks,
>>> * etc. In this on-the-fly main memory indexing scenario, each
>>> individual
>>> * string is immediately matched as soon as it becomes available
>>> without any
>>> * persistance involved. This usage scenario and corresponding
>>> performance
>>> * profile is quite different in comparison to fulltext search over
>>> * persistent (read-mostly) indexes.
>>> *
>>> * Example XPath: count(/table/row[lucene:match(string(./firstname),
>>> * "James") > 0.0])
>>> */
>>> public static void main(String[] args) throws Exception {
>>> int k = -1;
>>> int runs = 5;
>>> if (args.length > ++k) runs = Integer.parseInt(args[k]);
>>>
>>> int nodes = 10000;
>>> if (args.length > ++k) nodes = Integer.parseInt(args[k]);
>>>
>>> String content = "James is out in the woods";
>>> if (args.length > ++k) content = args[k];
>>>
>>> String expression = "James";
>>> if (args.length > ++k) expression = args[k];
>>>
>>> LuceneMatcher matcher = new LuceneMatcher();
>>> Query query = matcher.parseQuery(expression); // to be reused N
>>> times
>>>
>>> for (int r = 0; r < runs; r++) {
>>> long start = System.currentTimeMillis();
>>> int matches = 0;
>>>
>>> for (int i = 0; i < nodes; i++) {
>>> // if (LuceneUtil.match(content + i, expression) > 0.0f) {
>>> if (matcher.match(matcher.createIndex(content + i), query) >
>>> 0.0f) {
>>> matches++;
>>> }
>>> }
>>>
>>> long end = System.currentTimeMillis();
>>> System.out.println("matches=" + matches);
>>> System.out.println("secs=" + ((end-start) / 1000.0f));
>>> System.out.println("queries/sec=" + (nodes / ((end-start) /
>>> 1000.0f)));
>>> System.out.println();
>>> }
>>> }
>>> }
>>>>
>>>
>>>>
>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
RE: [Performance] Streaming main memory indexing of single strings
Posted by Robert Engels <re...@ix.netcom.com>.
I think you are not approaching this the correct way.
Pseudo code:
Subclass IndexReader.
Get tokens from String 'document' using Lucene analyzers.
Build simple hash-map based data structures using tokens for terms, and term
positions.
reimplement termDocs() and termPositions() to use the structures from above.
run searches.
start again with next document.
-----Original Message-----
From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
Sent: Thursday, April 14, 2005 2:56 PM
To: java-dev@lucene.apache.org
Subject: Re: [Performance] Streaming main memory indexing of single
strings
Otis, this might be a misunderstanding.
- I'm not calling optimize(). That piece is commented out you if look
again at the code.
- The *streaming* use case requires that for each query I add one (and
only one) document (aka string) to an empty index:
repeat N times (where N is millions or billions):
add a single string (aka document) to an empty index
query the index
drop index (or delete it's document)
with the following API being called N times: float match(String text,
Query query)
So there's no possibility of adding many documents and thereafter
running the query. This in turn seems to mean that the IndexWriter
can't be kept open - unless I manually delete each document after each
query to repeatedly reuse the RAMDirectory, which I've also tried
before without any significant performance gain - deletion seems to
have substantial overhead in itself. Perhaps it would be better if
there were a Directory.deleteAllDocuments() or similar. Did you have
some other approach in mind?
As I said, Lucene's design doesn't seem to fit this streaming use case
pattern well. In *this* scenario one could easily do without any
locking, and without byte level organization in RAMDirectory and
RAMFile, etc because a single small string isn't a large persistent
multi-document index.
For some background, here's a small example for the kind of XQuery
functionality Nux/Lucene integration enables:
(: An XQuery that finds all books authored by James that have something
to do with "fish", sorted by relevance :)
declare namespace lucene = "java:nux.xom.xquery.XQueryUtil";
declare variable $query := "fish*~";
for $book in /books/book[author="James" and lucene:match(string(.),
$query) > 0.0]
let $score := lucene:match(string($book), $query)
order by $score descending
return (<score>{$score}</score>, $book)
More interestingly one can use this for classifying and routing XML
messages based on rules (i.e. queries) inspecting their content...
Any other clues about potential improvements would be greatly
appreciated.
Wolfgang.
On Apr 13, 2005, at 10:09 PM, Otis Gospodnetic wrote:
> It looks like you are calling that IndexWriter code in some loops,
> opening it and closing it in every iteration of the loop and also
> calling optimize. All of those things could be improved.
> Keep your IndexWriter open, don't close it, and optimize the index only
> once you are done adding documents to it.
>
> See the highlights and the snipets in the first hit:
> http://www.lucenebook.com/search?query=when+to+optimize
>
> Otis
>
>
> --- Wolfgang Hoschek <wh...@lbl.gov> wrote:
>
>> Hi,
>>
>> I'm wondering if anyone could let me know how to improve Lucene
>> performance for "streaming main memory indexing of single strings".
>> This would help to effectively integrate Lucene with the Nux XQuery
>> engine.
>>
>> Below is a small microbenchmark simulating STREAMING XQuery fulltext
>> search as typical for XML network routers, message queuing system,
>> P2P
>> networks, etc. In this on-the-fly main memory indexing scenario, each
>>
>> individual string is immediately matched as soon as it becomes
>> available without any persistance involved. This usage scenario and
>> corresponding performance profile is quite different in comparison to
>>
>> fulltext search over persistent (read-mostly) indexes.
>>
>> The benchmark runs at some 3000 lucene queries/sec (lucene-1.4.3)
>> which
>> is unfortunate news considering the XQuery engine can easily walk
>> hundreds of thousands of XML nodes per second. Ideally I'd like to
>> run
>> at some 100000 queries/sec. Runnning this through the JDK 1.5
>> profiler
>> it seems that most time is spent in and below the following calls:
>>
>> writer = new IndexWriter(dir, analyzer, true);
>> writer.addDocument(...);
>> writer.close();
>>
>> I tried quite a few variants of the benchmark with various options,
>> unfortunately with little or no effect.
>> Lucene just does not seem to designed to do this sort of "transient
>> single string index" thing. All code paths related to opening,
>> closing,
>> reading, writing, querying and object creation seem to be designed
>> for
>> large persistent indexes.
>>
>> Any advice on what I'm missing or what could be done about it would
>> be
>> greatly appreciated.
>>
>> Wolfgang.
>>
>> P.S. the benchmark code is attached as a file below:
>>
>>> package nux.xom.pool;
>>
>> import java.io.IOException;
>> //import java.io.Reader;
>>
>> import org.apache.lucene.analysis.Analyzer;
>> //import org.apache.lucene.analysis.LowerCaseTokenizer;
>> //import org.apache.lucene.analysis.PorterStemFilter;
>> //import org.apache.lucene.analysis.SimpleAnalyzer;
>> //import org.apache.lucene.analysis.TokenStream;
>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>> import org.apache.lucene.document.Document;
>> import org.apache.lucene.document.Field;
>> //import org.apache.lucene.index.IndexReader;
>> import org.apache.lucene.index.IndexWriter;
>> import org.apache.lucene.queryParser.ParseException;
>> import org.apache.lucene.queryParser.QueryParser;
>> import org.apache.lucene.search.Hits;
>> import org.apache.lucene.search.IndexSearcher;
>> import org.apache.lucene.search.Query;
>> import org.apache.lucene.search.Searcher;
>> import org.apache.lucene.store.Directory;
>> import org.apache.lucene.store.RAMDirectory;
>>
>> public final class LuceneMatcher { // TODO: make non-public
>>
>> private final Analyzer analyzer;
>> // private final Directory dir = new RAMDirectory();
>>
>> public LuceneMatcher() {
>> this(new StandardAnalyzer());
>> // this(new SimpleAnalyzer());
>> // this(new StopAnalyzer());
>> // this(new Analyzer() {
>> // public final TokenStream tokenStream(String fieldName, Reader
>> reader) {
>> // return new PorterStemFilter(new LowerCaseTokenizer(reader));
>> // }
>> // });
>> }
>>
>> public LuceneMatcher(Analyzer analyzer) {
>> if (analyzer == null)
>> throw new IllegalArgumentException("analyzer must not be null");
>> this.analyzer = analyzer;
>> }
>>
>> public Query parseQuery(String expression) throws ParseException {
>> QueryParser parser = new QueryParser("content", analyzer);
>> // parser.setPhraseSlop(0);
>> return parser.parse(expression);
>> }
>>
>> /**
>> * Returns the relevance score by matching the given index against
>> the given
>> * Lucene query expression. The index must not contain more than one
>> Lucene
>> * "document" (aka string to be searched).
>> */
>> public float match(Directory index, Query query) {
>> Searcher searcher = null;
>> try {
>> searcher = new IndexSearcher(index);
>> Hits hits = searcher.search(query);
>> float score = hits.length() > 0 ? hits.score(0) : 0.0f;
>> return score;
>> } catch (IOException e) { // should never happen (RAMDirectory)
>> throw new RuntimeException(e);
>> } finally {
>> try {
>> if (searcher != null) searcher.close();
>> } catch (IOException e) { // should never happen (RAMDirectory)
>> throw new RuntimeException(e);
>> }
>> }
>> }
>>
>> // public float match(String text, Query query) {
>> // return match(createIndex(text), query);
>> // }
>>
>> public Directory createIndex(String text) {
>> Directory dir = new RAMDirectory();
>> IndexWriter writer = null;
>> try {
>> writer = new IndexWriter(dir, analyzer, true);
>> // writer.setUseCompoundFile(false);
>> // writer.mergeFactor = 2;
>> // writer.minMergeDocs = 1;
>> // writer.maxMergeDocs = 1;
>>
>> writer.addDocument(createDocument(text));
>> // writer.optimize();
>> return dir;
>> } catch (IOException e) { // should never happen (RAMDirectory)
>> throw new RuntimeException(e);
>> } finally {
>> try {
>> if (writer != null) writer.close();
>> } catch (IOException e) { // should never happen (RAMDirectory)
>> throw new RuntimeException(e);
>> }
>> }
>> }
>>
>> private Document createDocument(String content) {
>> Document doc = new Document();
>> doc.add(Field.UnStored("content", content));
>> // doc.add(Field.Text("x", content));
>> return doc;
>> }
>>
>> /**
>> * Lucene microbenchmark simulating STREAMING XQuery fulltext search
>> as
>> * typical for XML network routers, message queuing system, P2P
>> networks,
>> * etc. In this on-the-fly main memory indexing scenario, each
>> individual
>> * string is immediately matched as soon as it becomes available
>> without any
>> * persistance involved. This usage scenario and corresponding
>> performance
>> * profile is quite different in comparison to fulltext search over
>> * persistent (read-mostly) indexes.
>> *
>> * Example XPath: count(/table/row[lucene:match(string(./firstname),
>> * "James") > 0.0])
>> */
>> public static void main(String[] args) throws Exception {
>> int k = -1;
>> int runs = 5;
>> if (args.length > ++k) runs = Integer.parseInt(args[k]);
>>
>> int nodes = 10000;
>> if (args.length > ++k) nodes = Integer.parseInt(args[k]);
>>
>> String content = "James is out in the woods";
>> if (args.length > ++k) content = args[k];
>>
>> String expression = "James";
>> if (args.length > ++k) expression = args[k];
>>
>> LuceneMatcher matcher = new LuceneMatcher();
>> Query query = matcher.parseQuery(expression); // to be reused N
>> times
>>
>> for (int r = 0; r < runs; r++) {
>> long start = System.currentTimeMillis();
>> int matches = 0;
>>
>> for (int i = 0; i < nodes; i++) {
>> // if (LuceneUtil.match(content + i, expression) > 0.0f) {
>> if (matcher.match(matcher.createIndex(content + i), query) >
>> 0.0f) {
>> matches++;
>> }
>> }
>>
>> long end = System.currentTimeMillis();
>> System.out.println("matches=" + matches);
>> System.out.println("secs=" + ((end-start) / 1000.0f));
>> System.out.println("queries/sec=" + (nodes / ((end-start) /
>> 1000.0f)));
>> System.out.println();
>> }
>> }
>> }
>>>
>>
>>>
> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Wolfgang Hoschek <wh...@lbl.gov>.
Otis, this might be a misunderstanding.
- I'm not calling optimize(). That piece is commented out you if look
again at the code.
- The *streaming* use case requires that for each query I add one (and
only one) document (aka string) to an empty index:
repeat N times (where N is millions or billions):
add a single string (aka document) to an empty index
query the index
drop index (or delete it's document)
with the following API being called N times: float match(String text,
Query query)
So there's no possibility of adding many documents and thereafter
running the query. This in turn seems to mean that the IndexWriter
can't be kept open - unless I manually delete each document after each
query to repeatedly reuse the RAMDirectory, which I've also tried
before without any significant performance gain - deletion seems to
have substantial overhead in itself. Perhaps it would be better if
there were a Directory.deleteAllDocuments() or similar. Did you have
some other approach in mind?
As I said, Lucene's design doesn't seem to fit this streaming use case
pattern well. In *this* scenario one could easily do without any
locking, and without byte level organization in RAMDirectory and
RAMFile, etc because a single small string isn't a large persistent
multi-document index.
For some background, here's a small example for the kind of XQuery
functionality Nux/Lucene integration enables:
(: An XQuery that finds all books authored by James that have something
to do with "fish", sorted by relevance :)
declare namespace lucene = "java:nux.xom.xquery.XQueryUtil";
declare variable $query := "fish*~";
for $book in /books/book[author="James" and lucene:match(string(.),
$query) > 0.0]
let $score := lucene:match(string($book), $query)
order by $score descending
return (<score>{$score}</score>, $book)
More interestingly one can use this for classifying and routing XML
messages based on rules (i.e. queries) inspecting their content...
Any other clues about potential improvements would be greatly
appreciated.
Wolfgang.
On Apr 13, 2005, at 10:09 PM, Otis Gospodnetic wrote:
> It looks like you are calling that IndexWriter code in some loops,
> opening it and closing it in every iteration of the loop and also
> calling optimize. All of those things could be improved.
> Keep your IndexWriter open, don't close it, and optimize the index only
> once you are done adding documents to it.
>
> See the highlights and the snipets in the first hit:
> http://www.lucenebook.com/search?query=when+to+optimize
>
> Otis
>
>
> --- Wolfgang Hoschek <wh...@lbl.gov> wrote:
>
>> Hi,
>>
>> I'm wondering if anyone could let me know how to improve Lucene
>> performance for "streaming main memory indexing of single strings".
>> This would help to effectively integrate Lucene with the Nux XQuery
>> engine.
>>
>> Below is a small microbenchmark simulating STREAMING XQuery fulltext
>> search as typical for XML network routers, message queuing system,
>> P2P
>> networks, etc. In this on-the-fly main memory indexing scenario, each
>>
>> individual string is immediately matched as soon as it becomes
>> available without any persistance involved. This usage scenario and
>> corresponding performance profile is quite different in comparison to
>>
>> fulltext search over persistent (read-mostly) indexes.
>>
>> The benchmark runs at some 3000 lucene queries/sec (lucene-1.4.3)
>> which
>> is unfortunate news considering the XQuery engine can easily walk
>> hundreds of thousands of XML nodes per second. Ideally I'd like to
>> run
>> at some 100000 queries/sec. Runnning this through the JDK 1.5
>> profiler
>> it seems that most time is spent in and below the following calls:
>>
>> writer = new IndexWriter(dir, analyzer, true);
>> writer.addDocument(...);
>> writer.close();
>>
>> I tried quite a few variants of the benchmark with various options,
>> unfortunately with little or no effect.
>> Lucene just does not seem to designed to do this sort of "transient
>> single string index" thing. All code paths related to opening,
>> closing,
>> reading, writing, querying and object creation seem to be designed
>> for
>> large persistent indexes.
>>
>> Any advice on what I'm missing or what could be done about it would
>> be
>> greatly appreciated.
>>
>> Wolfgang.
>>
>> P.S. the benchmark code is attached as a file below:
>>
>>> package nux.xom.pool;
>>
>> import java.io.IOException;
>> //import java.io.Reader;
>>
>> import org.apache.lucene.analysis.Analyzer;
>> //import org.apache.lucene.analysis.LowerCaseTokenizer;
>> //import org.apache.lucene.analysis.PorterStemFilter;
>> //import org.apache.lucene.analysis.SimpleAnalyzer;
>> //import org.apache.lucene.analysis.TokenStream;
>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>> import org.apache.lucene.document.Document;
>> import org.apache.lucene.document.Field;
>> //import org.apache.lucene.index.IndexReader;
>> import org.apache.lucene.index.IndexWriter;
>> import org.apache.lucene.queryParser.ParseException;
>> import org.apache.lucene.queryParser.QueryParser;
>> import org.apache.lucene.search.Hits;
>> import org.apache.lucene.search.IndexSearcher;
>> import org.apache.lucene.search.Query;
>> import org.apache.lucene.search.Searcher;
>> import org.apache.lucene.store.Directory;
>> import org.apache.lucene.store.RAMDirectory;
>>
>> public final class LuceneMatcher { // TODO: make non-public
>>
>> private final Analyzer analyzer;
>> // private final Directory dir = new RAMDirectory();
>>
>> public LuceneMatcher() {
>> this(new StandardAnalyzer());
>> // this(new SimpleAnalyzer());
>> // this(new StopAnalyzer());
>> // this(new Analyzer() {
>> // public final TokenStream tokenStream(String fieldName, Reader
>> reader) {
>> // return new PorterStemFilter(new LowerCaseTokenizer(reader));
>> // }
>> // });
>> }
>>
>> public LuceneMatcher(Analyzer analyzer) {
>> if (analyzer == null)
>> throw new IllegalArgumentException("analyzer must not be null");
>> this.analyzer = analyzer;
>> }
>>
>> public Query parseQuery(String expression) throws ParseException {
>> QueryParser parser = new QueryParser("content", analyzer);
>> // parser.setPhraseSlop(0);
>> return parser.parse(expression);
>> }
>>
>> /**
>> * Returns the relevance score by matching the given index against
>> the given
>> * Lucene query expression. The index must not contain more than one
>> Lucene
>> * "document" (aka string to be searched).
>> */
>> public float match(Directory index, Query query) {
>> Searcher searcher = null;
>> try {
>> searcher = new IndexSearcher(index);
>> Hits hits = searcher.search(query);
>> float score = hits.length() > 0 ? hits.score(0) : 0.0f;
>> return score;
>> } catch (IOException e) { // should never happen (RAMDirectory)
>> throw new RuntimeException(e);
>> } finally {
>> try {
>> if (searcher != null) searcher.close();
>> } catch (IOException e) { // should never happen (RAMDirectory)
>> throw new RuntimeException(e);
>> }
>> }
>> }
>>
>> // public float match(String text, Query query) {
>> // return match(createIndex(text), query);
>> // }
>>
>> public Directory createIndex(String text) {
>> Directory dir = new RAMDirectory();
>> IndexWriter writer = null;
>> try {
>> writer = new IndexWriter(dir, analyzer, true);
>> // writer.setUseCompoundFile(false);
>> // writer.mergeFactor = 2;
>> // writer.minMergeDocs = 1;
>> // writer.maxMergeDocs = 1;
>>
>> writer.addDocument(createDocument(text));
>> // writer.optimize();
>> return dir;
>> } catch (IOException e) { // should never happen (RAMDirectory)
>> throw new RuntimeException(e);
>> } finally {
>> try {
>> if (writer != null) writer.close();
>> } catch (IOException e) { // should never happen (RAMDirectory)
>> throw new RuntimeException(e);
>> }
>> }
>> }
>>
>> private Document createDocument(String content) {
>> Document doc = new Document();
>> doc.add(Field.UnStored("content", content));
>> // doc.add(Field.Text("x", content));
>> return doc;
>> }
>>
>> /**
>> * Lucene microbenchmark simulating STREAMING XQuery fulltext search
>> as
>> * typical for XML network routers, message queuing system, P2P
>> networks,
>> * etc. In this on-the-fly main memory indexing scenario, each
>> individual
>> * string is immediately matched as soon as it becomes available
>> without any
>> * persistance involved. This usage scenario and corresponding
>> performance
>> * profile is quite different in comparison to fulltext search over
>> * persistent (read-mostly) indexes.
>> *
>> * Example XPath: count(/table/row[lucene:match(string(./firstname),
>> * "James") > 0.0])
>> */
>> public static void main(String[] args) throws Exception {
>> int k = -1;
>> int runs = 5;
>> if (args.length > ++k) runs = Integer.parseInt(args[k]);
>>
>> int nodes = 10000;
>> if (args.length > ++k) nodes = Integer.parseInt(args[k]);
>>
>> String content = "James is out in the woods";
>> if (args.length > ++k) content = args[k];
>>
>> String expression = "James";
>> if (args.length > ++k) expression = args[k];
>>
>> LuceneMatcher matcher = new LuceneMatcher();
>> Query query = matcher.parseQuery(expression); // to be reused N
>> times
>>
>> for (int r = 0; r < runs; r++) {
>> long start = System.currentTimeMillis();
>> int matches = 0;
>>
>> for (int i = 0; i < nodes; i++) {
>> // if (LuceneUtil.match(content + i, expression) > 0.0f) {
>> if (matcher.match(matcher.createIndex(content + i), query) >
>> 0.0f) {
>> matches++;
>> }
>> }
>>
>> long end = System.currentTimeMillis();
>> System.out.println("matches=" + matches);
>> System.out.println("secs=" + ((end-start) / 1000.0f));
>> System.out.println("queries/sec=" + (nodes / ((end-start) /
>> 1000.0f)));
>> System.out.println();
>> }
>> }
>> }
>>>
>>
>>>
> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [Performance] Streaming main memory indexing of single strings
Posted by Otis Gospodnetic <ot...@yahoo.com>.
It looks like you are calling that IndexWriter code in some loops,
opening it and closing it in every iteration of the loop and also
calling optimize. All of those things could be improved.
Keep your IndexWriter open, don't close it, and optimize the index only
once you are done adding documents to it.
See the highlights and the snipets in the first hit:
http://www.lucenebook.com/search?query=when+to+optimize
Otis
--- Wolfgang Hoschek <wh...@lbl.gov> wrote:
> Hi,
>
> I'm wondering if anyone could let me know how to improve Lucene
> performance for "streaming main memory indexing of single strings".
> This would help to effectively integrate Lucene with the Nux XQuery
> engine.
>
> Below is a small microbenchmark simulating STREAMING XQuery fulltext
> search as typical for XML network routers, message queuing system,
> P2P
> networks, etc. In this on-the-fly main memory indexing scenario, each
>
> individual string is immediately matched as soon as it becomes
> available without any persistance involved. This usage scenario and
> corresponding performance profile is quite different in comparison to
>
> fulltext search over persistent (read-mostly) indexes.
>
> The benchmark runs at some 3000 lucene queries/sec (lucene-1.4.3)
> which
> is unfortunate news considering the XQuery engine can easily walk
> hundreds of thousands of XML nodes per second. Ideally I'd like to
> run
> at some 100000 queries/sec. Runnning this through the JDK 1.5
> profiler
> it seems that most time is spent in and below the following calls:
>
> writer = new IndexWriter(dir, analyzer, true);
> writer.addDocument(...);
> writer.close();
>
> I tried quite a few variants of the benchmark with various options,
> unfortunately with little or no effect.
> Lucene just does not seem to designed to do this sort of "transient
> single string index" thing. All code paths related to opening,
> closing,
> reading, writing, querying and object creation seem to be designed
> for
> large persistent indexes.
>
> Any advice on what I'm missing or what could be done about it would
> be
> greatly appreciated.
>
> Wolfgang.
>
> P.S. the benchmark code is attached as a file below:
>
> > package nux.xom.pool;
>
> import java.io.IOException;
> //import java.io.Reader;
>
> import org.apache.lucene.analysis.Analyzer;
> //import org.apache.lucene.analysis.LowerCaseTokenizer;
> //import org.apache.lucene.analysis.PorterStemFilter;
> //import org.apache.lucene.analysis.SimpleAnalyzer;
> //import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> //import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.queryParser.ParseException;
> import org.apache.lucene.queryParser.QueryParser;
> import org.apache.lucene.search.Hits;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.Searcher;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.RAMDirectory;
>
> public final class LuceneMatcher { // TODO: make non-public
>
> private final Analyzer analyzer;
> // private final Directory dir = new RAMDirectory();
>
> public LuceneMatcher() {
> this(new StandardAnalyzer());
> // this(new SimpleAnalyzer());
> // this(new StopAnalyzer());
> // this(new Analyzer() {
> // public final TokenStream tokenStream(String fieldName, Reader
> reader) {
> // return new PorterStemFilter(new LowerCaseTokenizer(reader));
> // }
> // });
> }
>
> public LuceneMatcher(Analyzer analyzer) {
> if (analyzer == null)
> throw new IllegalArgumentException("analyzer must not be null");
> this.analyzer = analyzer;
> }
>
> public Query parseQuery(String expression) throws ParseException {
> QueryParser parser = new QueryParser("content", analyzer);
> // parser.setPhraseSlop(0);
> return parser.parse(expression);
> }
>
> /**
> * Returns the relevance score by matching the given index against
> the given
> * Lucene query expression. The index must not contain more than one
> Lucene
> * "document" (aka string to be searched).
> */
> public float match(Directory index, Query query) {
> Searcher searcher = null;
> try {
> searcher = new IndexSearcher(index);
> Hits hits = searcher.search(query);
> float score = hits.length() > 0 ? hits.score(0) : 0.0f;
> return score;
> } catch (IOException e) { // should never happen (RAMDirectory)
> throw new RuntimeException(e);
> } finally {
> try {
> if (searcher != null) searcher.close();
> } catch (IOException e) { // should never happen (RAMDirectory)
> throw new RuntimeException(e);
> }
> }
> }
>
> // public float match(String text, Query query) {
> // return match(createIndex(text), query);
> // }
>
> public Directory createIndex(String text) {
> Directory dir = new RAMDirectory();
> IndexWriter writer = null;
> try {
> writer = new IndexWriter(dir, analyzer, true);
> // writer.setUseCompoundFile(false);
> // writer.mergeFactor = 2;
> // writer.minMergeDocs = 1;
> // writer.maxMergeDocs = 1;
>
> writer.addDocument(createDocument(text));
> // writer.optimize();
> return dir;
> } catch (IOException e) { // should never happen (RAMDirectory)
> throw new RuntimeException(e);
> } finally {
> try {
> if (writer != null) writer.close();
> } catch (IOException e) { // should never happen (RAMDirectory)
> throw new RuntimeException(e);
> }
> }
> }
>
> private Document createDocument(String content) {
> Document doc = new Document();
> doc.add(Field.UnStored("content", content));
> // doc.add(Field.Text("x", content));
> return doc;
> }
>
> /**
> * Lucene microbenchmark simulating STREAMING XQuery fulltext search
> as
> * typical for XML network routers, message queuing system, P2P
> networks,
> * etc. In this on-the-fly main memory indexing scenario, each
> individual
> * string is immediately matched as soon as it becomes available
> without any
> * persistance involved. This usage scenario and corresponding
> performance
> * profile is quite different in comparison to fulltext search over
> * persistent (read-mostly) indexes.
> *
> * Example XPath: count(/table/row[lucene:match(string(./firstname),
> * "James") > 0.0])
> */
> public static void main(String[] args) throws Exception {
> int k = -1;
> int runs = 5;
> if (args.length > ++k) runs = Integer.parseInt(args[k]);
>
> int nodes = 10000;
> if (args.length > ++k) nodes = Integer.parseInt(args[k]);
>
> String content = "James is out in the woods";
> if (args.length > ++k) content = args[k];
>
> String expression = "James";
> if (args.length > ++k) expression = args[k];
>
> LuceneMatcher matcher = new LuceneMatcher();
> Query query = matcher.parseQuery(expression); // to be reused N
> times
>
> for (int r = 0; r < runs; r++) {
> long start = System.currentTimeMillis();
> int matches = 0;
>
> for (int i = 0; i < nodes; i++) {
> // if (LuceneUtil.match(content + i, expression) > 0.0f) {
> if (matcher.match(matcher.createIndex(content + i), query) >
> 0.0f) {
> matches++;
> }
> }
>
> long end = System.currentTimeMillis();
> System.out.println("matches=" + matches);
> System.out.println("secs=" + ((end-start) / 1000.0f));
> System.out.println("queries/sec=" + (nodes / ((end-start) /
> 1000.0f)));
> System.out.println();
> }
> }
> }
> >
>
> >
---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org