You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Preetham Kajekar <pr...@cisco.com> on 2009/01/22 21:10:43 UTC
Re: Combining results of multiple indexes
Hi,
Just thought of sharing some more progress I made on this.
This time I created multiple (2) indexWriter writing different
documents (based on if it is odd or even based on an id - not doc-id) to
different indexes and the performance seems to scale up based on the
number of threads (and the number of CPU's.
So while querying, I will use all these indexes to get matches.
What do you think about this ? Will querying etc be considerable slower ?
Thanks,
~preetham
Preetham Kajekar wrote:
> Hi,
> I noticed that the doc id is the same. So, if I have HitCollector,
> just collect the doc-ids of both Searchers (for the two indexes) and
> find the intersection between them, it would work. Also, get the doc
> is even where there are large number of hits is fast.
>
> Of course, I am using something undocumented of Lucene.
>
> Thanks,
> ~preetham
>
> Preetham Kajekar wrote:
>> Thanks. Yep the code is very easy. However, it take about 3 mins to
>> complete merging.
>>
>> Looks like I will need to have an out of band merging of indexes once
>> they are closed (planning to store about 50mil entries in each index
>> partition)
>>
>>
>> However, as the data is being indexed, is there any other way to
>> combine results ?
>>
>> I could get the results of one index, get all the hits and then apply
>> this as a filter for the next index. But if there are large number of
>> hits (which is likely to be the case), this would not perform too well.
>>
>> Do you think the document id can be used in anyway. How is the
>> document id generated ? After all, i have the two indexes operating
>> on a common List of objects. Would the doc is in index1 and index2
>> for object N in the list be the same ?
>>
>>
>> Thanks,
>> ~preetham
>>
>> Erick Erickson wrote:
>>> You will be stunned at how easy it is. The merging code should be
>>> a dozen lines (and that only if you are merging 6 or so indexes)....
>>>
>>> See IndexWriter.addIndexes or
>>> IndexWriter.addIndexesNoOptimize
>>>
>>> Best
>>> Erick
>>>
>>> On Thu, Dec 18, 2008 at 5:03 AM, Preetham Kajekar
>>> <pr...@cisco.com>wrote:
>>>
>>>
>>>> Hi,
>>>> I tried out a single IndexWriter used by two threads to index
>>>> different
>>>> fields. It is slower than using two separate IndexWriters. These
>>>> are my
>>>> findings
>>>>
>>>> All Fields (9) using 1 IndexWriter 1 Thread - 38,000 object per sec
>>>> 5 Fields using 1 IndexWriter 1 Thread - 62,000 object per sec
>>>> All Fields (9) using 1 IndexWriter 2 Thread - 29,000 object per sec
>>>> All Fields (9) using 2 IndexWriter 2 Thread - 55,000 object per sec
>>>>
>>>> So, it looks like I will have figure how to combine results of
>>>> multiple
>>>> indexes.
>>>>
>>>> Thanks,
>>>> ~preetham
>>>>
>>>>
>>>> Preetham Kajekar wrote:
>>>>
>>>>
>>>>> Thanks Erick and Michael.
>>>>> I will try out these suggestions and post my findings.
>>>>>
>>>>> ~preetham
>>>>>
>>>>> Erick Erickson wrote:
>>>>>
>>>>>
>>>>>> Well, maybe if I'd read the original post more carefully I'd have
>>>>>> figured
>>>>>> that out,
>>>>>> sorry 'bout that.
>>>>>>
>>>>>> I *think* I remember reading somewhere on the email lists that your
>>>>>> indexing
>>>>>> speed goes up pretty linearly as the number of indexing tasks
>>>>>> approaches
>>>>>> the number of CPUs. Are you, perhaps, on a dual-core machine? But do
>>>>>> search
>>>>>> the mail archives because my memory may not be accurate.
>>>>>>
>>>>>> You can easily combine indexes by IndexWriter.addIndexes BTW.
>>>>>> Personally
>>>>>> I prefer fewer indexes if you can get away with it. But I'd only
>>>>>> try this
>>>>>> after
>>>>>> Michael's suggestion of using multiple threads on a single
>>>>>> underlying
>>>>>> writer.
>>>>>>
>>>>>> You could even think about using N machines to create M fragments
>>>>>> then
>>>>>> combining them all afterwards if your logs are static enough to
>>>>>> make that
>>>>>> reasonable. Combining indexes may take a while though.....
>>>>>>
>>>>>> Best
>>>>>> Erick
>>>>>>
>>>>>> On Wed, Dec 17, 2008 at 10:46 AM, Preetham Kajekar
>>>>>> <preetham@cisco.com
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>
>>>>>>
>>>>>>> Hi Erick,
>>>>>>> Thanks for the response. Replies inline.
>>>>>>>
>>>>>>> Erick Erickson wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> The very first question is always "are you opening a new searcher
>>>>>>>> each time you query"? But you've looked at the Wiki so I assume
>>>>>>>> not.
>>>>>>>> This question is closely tied to what kind of latency you can
>>>>>>>> tolerate.
>>>>>>>>
>>>>>>>> A few more details, please. What's slow? Queries? Indexing?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> Indexing. Again, it is not slow. It is just faster with two
>>>>>>> separate
>>>>>>> indexers in two threads.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> How slow? 100ms? 100s? What are your target times and
>>>>>>>> what are you seeing?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> With a single indexer in a single thread, I can index about
>>>>>>> 20,000 event
>>>>>>> objects per second. With 2 thread and 2 indexers, it is close to
>>>>>>> 50,000.
>>>>>>> :-)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> How big is your index? 100M? 100G? What kind of VM
>>>>>>>> parameters are you specifying?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> The index will have about 20mil entries. The size of the index
>>>>>>> lands up
>>>>>>> being about 500M.
>>>>>>> I start the VM with 1G of heap. No other options for GC etc is
>>>>>>> used.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> As an aside, do note that there's no requirement in Lucene that
>>>>>>>> each document have the same fields, so it's unclear why you
>>>>>>>> need two indexes, but perhaps some of the answers to the above
>>>>>>>> will help us understand.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> Like I mentioned, Lucene does the job much faster with two indexes.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Also, be very very careful what you measure when you measure
>>>>>>>> queries. You absolutely *have* to put some instrumentation in
>>>>>>>> the code since "slow queries" can result from things other than
>>>>>>>> searching. For instance, iterating over a Hits object for 100s of
>>>>>>>> documents....
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> The Query speeds are much faster than what I need :-) So no
>>>>>>> complains
>>>>>>> here.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Show the code, man <G>!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> Code below. EvIndexer is the base class. There are two
>>>>>>> subclasses which
>>>>>>> implement addEvFieldsToIndexDoc() (template pattern) to add
>>>>>>> different
>>>>>>> fields
>>>>>>> to the index. that code is also pasted below
>>>>>>>
>>>>>>> --Code ---
>>>>>>>
>>>>>>> BaseClass
>>>>>>>
>>>>>>> public EvIndexer(String indexName) throws Exception {
>>>>>>> this.name = indexName;
>>>>>>> a = new KeywordAnalyzer();
>>>>>>> INDEX_PATH = System.getProperty(StoreManager.PROP_DB_DB_LOC,
>>>>>>> "./index/");
>>>>>>> FSDirectory directory = FSDirectory.getDirectory(INDEX_PATH +
>>>>>>> File.separatorChar + indexName, NoLockFactory.getNoLockFactory());
>>>>>>> indexWriter = new IndexWriter(directory, a,
>>>>>>> IndexWriter.MaxFieldLength.LIMITED);
>>>>>>> //indexWriter.setUseCompoundFile(false);
>>>>>>> //indexWriter.setRAMBufferSizeMB(256);
>>>>>>> }
>>>>>>> /** Method implemented by extending classes to add data
>>>>>>> into the
>>>>>>> index document for the
>>>>>>> * given event
>>>>>>> *
>>>>>>> * @param d
>>>>>>> */
>>>>>>> protected abstract void addEvFieldsToIndexDoc(Document d, Ev
>>>>>>> event);
>>>>>>> public void addToIndex(Ev ev) throws Exception {
>>>>>>> noOfEventsIndexed++;
>>>>>>> Document d = new Document();
>>>>>>> addEvFieldsToIndexDoc(d,
>>>>>>> ev);
>>>>>>> indexWriter.addDocument(d);
>>>>>>> if ((noOfEventsIndexed % COMMIT_INTERVAL) == 0) {
>>>>>>> System.out.println(name + " indexed " +
>>>>>>> NumberFormat.getInstance().format(noOfEventsIndexed) + " Commiting
>>>>>>> them");
>>>>>>> commit();
>>>>>>> } }
>>>>>>>
>>>>>>> DerievdClass1
>>>>>>> protected void addEvFieldsToIndexDoc(Document d, Ev ev) {
>>>>>>> //noOfEventsIndexed++;
>>>>>>> Field id = new Field(EV_ID, Long.toString(ev.getId()),
>>>>>>> Field.Store.YES, Field.Index.NO);
>>>>>>> Field src = new Field(EV_SRC, Long.toString(ev.getSrcId()),
>>>>>>> Field.Store.NO, Field.Index.NOT_ANALYZED);
>>>>>>> Field type = new Field(EV_TYPE,
>>>>>>> Integer.toString(ev.getEventTypeId()), Field.Store.NO,
>>>>>>> Field.Index.NOT_ANALYZED);
>>>>>>> Field pri = new Field(EV_PRI,
>>>>>>> Short.toString(ev.getPriority()) ,
>>>>>>> Field.Store.NO, Field.Index.NOT_ANALYZED);
>>>>>>> Field time = new Field(EV_TIME,
>>>>>>> getHexString(ev.getRecvTime()) ,
>>>>>>> Field.Store.NO, Field.Index.NOT_ANALYZED);
>>>>>>> d.add(id);
>>>>>>> d.add(src);
>>>>>>> d.add(type);
>>>>>>> d.add(pri);
>>>>>>> d.add(time);
>>>>>>> //noOfFieldsIndexed += 4;
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks for the support.
>>>>>>> ~preetham
>>>>>>>
>>>>>>>
>>>>>>> Best
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Erick
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Dec 17, 2008 at 9:40 AM, Preetham Kajekar
>>>>>>>> <preetham@cisco.com
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hi Grant,
>>>>>>>>> Thanks four response. Replies inline.
>>>>>>>>>
>>>>>>>>> Grant Ingersoll wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On Dec 17, 2008, at 12:57 AM, Preetham Kajekar wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I am new to Lucene. I am not using it as a pure text indexer.
>>>>>>>>>>>
>>>>>>>>>>> I am trying to index a Java object which has about 10 fields
>>>>>>>>>>> (like
>>>>>>>>>>> id,
>>>>>>>>>>> time, srcIp, dstIp) - most of them being numerical values.
>>>>>>>>>>> In order to speed up indexing, I figured that having two
>>>>>>>>>>> separate
>>>>>>>>>>> indexers, each of them indexing different set of fields
>>>>>>>>>>> works great.
>>>>>>>>>>> So
>>>>>>>>>>> I
>>>>>>>>>>> have the first 5 fields in index1 and the remaining in index2.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> Can you explain this a bit more? Are those two fields really
>>>>>>>>>> large
>>>>>>>>>> org
>>>>>>>>>> something? How are you obtaining them? How are you
>>>>>>>>>> correlating the
>>>>>>>>>> documents between the two indexes? Did you actually try a
>>>>>>>>>> single
>>>>>>>>>> index
>>>>>>>>>> and
>>>>>>>>>> it was too slow?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> I have a java object which has about 10 fields. However, the
>>>>>>>>> fields
>>>>>>>>> are
>>>>>>>>> not
>>>>>>>>> fixed. The java object is essentially a representation of
>>>>>>>>> Syslogs from
>>>>>>>>> network devices. So different syslogs have different fields. Each
>>>>>>>>> field
>>>>>>>>> has
>>>>>>>>> a unique id and a value (mostly numeric types, so i convert it to
>>>>>>>>> string).
>>>>>>>>> There are some fixed fields. So the object is a list of fields
>>>>>>>>> which
>>>>>>>>> is
>>>>>>>>> produced by a parser.
>>>>>>>>> I am trying to index using two indexers in two separate
>>>>>>>>> threads- one
>>>>>>>>> for
>>>>>>>>> fixed and another for the non-fixed fields. Except for a
>>>>>>>>> unique id, I
>>>>>>>>> do
>>>>>>>>> not
>>>>>>>>> store the fields in Lucene - i just index them. From the
>>>>>>>>> index, i get
>>>>>>>>> the
>>>>>>>>> unique id which is all I care about. (the objects are stored
>>>>>>>>> elsewhere
>>>>>>>>> and
>>>>>>>>> can be looked up based on this unique id).
>>>>>>>>> I did try using a single indexer, but things were quite slow.
>>>>>>>>> Getting
>>>>>>>>> high
>>>>>>>>> throughput is crucial and having two indexers seemed to do
>>>>>>>>> very well.
>>>>>>>>> (more
>>>>>>>>> than twice as fast)
>>>>>>>>>
>>>>>>>>> Further, the index will never be modified and I can have just one
>>>>>>>>> thread
>>>>>>>>> writing to the index. If there are any other performance tips
>>>>>>>>> would be
>>>>>>>>> very
>>>>>>>>> helpful. I have already looked at the wiki link regarding
>>>>>>>>> performance
>>>>>>>>> and
>>>>>>>>> using some of them.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> ~preetham
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Now, I want to have boolean AND query's looking for values in
>>>>>>>>>> both
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> indexes. Like f1=1234 AND f7=ABCD.f1 and f7 and present in two
>>>>>>>>>>> separate
>>>>>>>>>>> indexes. Would using the MultiIndexReader help ? Since I am
>>>>>>>>>>> doing an
>>>>>>>>>>> AND, I
>>>>>>>>>>> dont expect that it would work.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> ~preetham
>>>>>>>>>>>
>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>>>>>> For additional commands, e-mail:
>>>>>>>>>>> java-user-help@lucene.apache.org
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> --------------------------
>>>>>>>>>> Grant Ingersoll
>>>>>>>>>>
>>>>>>>>>> Lucene Helpful Hints:
>>>>>>>>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>>>>>>>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>>>>> For additional commands, e-mail:
>>>>>>>>>> java-user-help@lucene.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>>
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>
>>>
>>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org