You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Shruthi <ss...@imedx.com> on 2014/05/19 12:40:27 UTC

NewBie To Lucene || Perfect configuration on a 64 bit server

Hi,

We are using Lucene 4.7 on our server application for searching the documents placed on a nasshare.  We have 10 million+ documents and have decided not to index all the documents. The strategy that we applied is as follows:

1.       Client makes a request with a search phrase. Lucene application indexes a list of 500 documents(at max. ) and searches the phrase on the index constructed.

2.       After search operation and results are returned to client we delete the indexes.

3.       Second client makes a request and we follow the process of building, searching and deleting the index all over again.

4.       Request can come paralleled also.



We have decided to use MMapDirectory for above requirement. Given below is our server configuration.


System architecture - 64 bit.
OS - Windows 2008, 64 bit Virtual Machine
Java JRE - 1.6
RAM - 4 GB
CPU - 4 CPU

Can you guide us the best way/ configuration to go about with using MMapDirectory.

Thanks,
Shruthi Sethi
SR. SOFTWARE ENGINEER
iMedX
OFFICE:

033-4001-5789 ext. N/A

MOBILE:

91-9903957546

EMAIL:

ssethi@imedx.com<ma...@imedx.com>

WEB:

www.imedx.com<http://www.imedx.com>

RE: NewBie To Lucene || Perfect configuration on a 64 bit server

Posted by Shruthi <ss...@imedx.com>.

Hi,

I have used the data import handler feature of importing 500 RTF's into Solr and have been able to search the query string also.  I have made use of fq param as a filter query parameter to limit my search space to only the document ids that I am interested in. However I am bogged with two questions:
1. Does filter query feature mean limiting of search space or does it actually  scan all the million Documents index and then create a filter. I am confused on the internals of fq implementations. Please throw some light onto it.
2. Our input documents are RTF's and we want to display the Solr response in the form of HTML on browser . All the resultant RTF's should be displayed in the browser retaining the formatting. I tried using the XSLTResponseWrite on the XMLOutput given by Solr to produce HTML..But the formatting is all lost. Is there any other way out. 

Thanks,

Shruthi Sethi
SR. SOFTWARE ENGINEER
iMedX
OFFICE:  
033-4001-5789 ext. N/A
MOBILE:  
91-9903957546
EMAIL:  
ssethi@imedx.com
WEB:  
www.imedx.com



-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Saturday, May 31, 2014 1:21 AM
To: java-user
Subject: Re: NewBie To Lucene || Perfect configuration on a 64 bit server

Hmmm, you might want to move this over to the Solr user's list. This list
is lucene, which doesn't have anything to do with post.jar ;)...


On Fri, May 30, 2014 at 8:25 AM, Erick Erickson <er...@gmail.com>
wrote:

> Try a cURL statement like:
>
> curl "
> http://localhost:8983/solr/update/extract?literal.id=doc33&captureAttr=true&defaultField=text"
> -F "myfile=@testRTFVarious.rtf
>
>
> first, then work up to the post.jar bits...
>
>
> Two cautions:
>
> 1> make sure to commit afterwards. Something like
>
> http://localhost:8983/solr/collection1/update?commit=true
>
> will work
>
> 2> uncomment the line:
>
> <dynamicField name="*" type="string" multiValued="true" />
>
>
> in your schema.xml (and restart solr).
>
>
> Also, track your output to see if it went through successfully.
>
>
> Best,
> Erick
>
>
> On Fri, May 30, 2014 at 7:35 AM, Shruthi <ss...@imedx.com> wrote:
>
>> Hi ,
>>
>> Finally I have been able to convince my team to go for indexing all
>> documents first and then search rather than do on the fly indexing. I have
>> set up Solr on my machine but unable to index RTF'f. I have followed the
>> tutorial..but no where RTF is mentioned..Can someone please help me..
>> I tried following options:
>> Java -Dtype=application/RTF -jar post.jar *.RTF
>> Java -Dtype=application/rtf -jar post.jar *.RTF
>> Java -Dtype=text/RTF -jar post.jar *.RTF
>>
>> Thanks,
>> Shruthi Sethi
>> SR. SOFTWARE ENGINEER
>> iMedX
>> OFFICE:
>> 033-4001-5789 ext. N/A
>> MOBILE:
>> 91-9903957546
>> EMAIL:
>> ssethi@imedx.com
>> WEB:
>> www.imedx.com
>>
>>
>>
>> -----Original Message-----
>> From: Ralf Heyde [mailto:xoodrenalin@gmx.de]
>> Sent: Tuesday, May 27, 2014 11:56 AM
>> To: java-user@lucene.apache.org
>> Subject: AW: NewBie To Lucene || Perfect configuration on a 64 bit server
>>
>> Hey,
>>
>> I have several notes about your process.
>>
>> 1st: How you select the documents you are passing to the index for further
>> searching? Maybe it is more straight forward to "find" them on you
>> programming language?
>> 2nd: Storage is cheap, buy a hard-disk and store the overall index. The
>> most
>> expensive operation is the indexing and the first read access (caching on
>> Lucene / OS level). Imagine what happens when you build the index and
>> delete
>> it afterwards just for a "simple" search operation on a subset of your
>> documents.
>>
>> Cheers, Ralf
>>
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: rulinma [mailto:rulinma@gmail.com]
>> Gesendet: Dienstag, 27. Mai 2014 03:14
>> An: java-user@lucene.apache.org
>> Betreff: RE: NewBie To Lucene || Perfect configuration on a 64 bit server
>>
>> 1000+ is solr, lucenen more fast.
>>
>>
>>
>> --
>> View this message in context:
>>
>> http://lucene.472066.n3.nabble.com/NewBie-To-Lucene-Perfect-configuration-on
>> -a-64-bit-server-tp4136871p4138215.html
>> <http://lucene.472066.n3.nabble.com/NewBie-To-Lucene-Perfect-configuration-on-a-64-bit-server-tp4136871p4138215.html>
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance

Posted by Jamie <ja...@mailarchiva.com>.

I assume you meant 1000 documents. Yes, the page size is in fact 
configurable. However, it only obtains the page size * 3. It preloads 
the following and previous page too. The point is, it only obtains the 
documents that are needed.


On 2014/06/02, 3:03 PM, Tincu Gabriel wrote:
> My bad, It's using the RamDirectory as a cache and a delegate directory
> that you pass in the constructor to do the disk operations, limiting the
> use of the RamDirectory to files that fit a certain size. So i guess the
> underlying Directory implementation will be whatever you choose it to be.
> I'd still try using a MMapDirectory and see if that improves performance.
> Also, regarding the pagination, you said you're retrieving 1000 documents
> at a time. Does that mean that if a query matches 10000 documents you want
> all of them retrieved ?
>
>
> On Mon, Jun 2, 2014 at 12:51 PM, Jamie <ja...@mailarchiva.com> wrote:
>
>> I was under the impression that NRTCachingDirectory will instantiate an
>> MMapDirectory if a 64 bit platform is detected? Is this not the case?
>>
>>
>> On 2014/06/02, 2:09 PM, Tincu Gabriel wrote:
>>
>>> MMapDirectory will do the job for you. RamDirectory has a big warning in
>>> the class description stating that the performance will get killed by an
>>> index larger than a few hundred MB, and NRTCachingDirectory is a wrapper
>>> for RamDirectory and suitable for low update rates. MMap will use the
>>> system RAM to cache as much of the index it can and only hit disk when the
>>> portion of the index you're trying to access isn't cached. I'd put my
>>> money
>>> on switching directory implementations and see what kind of performance
>>> gains that brings to the table.
>>>
>>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance

Posted by Tincu Gabriel <ti...@gmail.com>.

My bad, It's using the RamDirectory as a cache and a delegate directory
that you pass in the constructor to do the disk operations, limiting the
use of the RamDirectory to files that fit a certain size. So i guess the
underlying Directory implementation will be whatever you choose it to be.
I'd still try using a MMapDirectory and see if that improves performance.
Also, regarding the pagination, you said you're retrieving 1000 documents
at a time. Does that mean that if a query matches 10000 documents you want
all of them retrieved ?

On Mon, Jun 2, 2014 at 12:51 PM, Jamie <ja...@mailarchiva.com> wrote:

> I was under the impression that NRTCachingDirectory will instantiate an
> MMapDirectory if a 64 bit platform is detected? Is this not the case?
>
>
> On 2014/06/02, 2:09 PM, Tincu Gabriel wrote:
>
>> MMapDirectory will do the job for you. RamDirectory has a big warning in
>> the class description stating that the performance will get killed by an
>> index larger than a few hundred MB, and NRTCachingDirectory is a wrapper
>> for RamDirectory and suitable for low update rates. MMap will use the
>> system RAM to cache as much of the index it can and only hit disk when the
>> portion of the index you're trying to access isn't cached. I'd put my
>> money
>> on switching directory implementations and see what kind of performance
>> gains that brings to the table.
>>
>>
>

Re: search performance

Posted by Jamie <ja...@mailarchiva.com>.

I was under the impression that NRTCachingDirectory will instantiate an 
MMapDirectory if a 64 bit platform is detected? Is this not the case?

On 2014/06/02, 2:09 PM, Tincu Gabriel wrote:
> MMapDirectory will do the job for you. RamDirectory has a big warning in
> the class description stating that the performance will get killed by an
> index larger than a few hundred MB, and NRTCachingDirectory is a wrapper
> for RamDirectory and suitable for low update rates. MMap will use the
> system RAM to cache as much of the index it can and only hit disk when the
> portion of the index you're trying to access isn't cached. I'd put my money
> on switching directory implementations and see what kind of performance
> gains that brings to the table.
>

Re: search performance

Posted by Tincu Gabriel <ti...@gmail.com>.

MMapDirectory will do the job for you. RamDirectory has a big warning in
the class description stating that the performance will get killed by an
index larger than a few hundred MB, and NRTCachingDirectory is a wrapper
for RamDirectory and suitable for low update rates. MMap will use the
system RAM to cache as much of the index it can and only hit disk when the
portion of the index you're trying to access isn't cached. I'd put my money
on switching directory implementations and see what kind of performance
gains that brings to the table.


On Mon, Jun 2, 2014 at 11:50 AM, Jamie <ja...@mailarchiva.com> wrote:

> Jack
>
> First off, thanks for applying your mind to our performance problem.
>
>
> On 2014/06/02, 1:34 PM, Jack Krupansky wrote:
>
>> Do you have enough system memory to fit the entire index in OS system
>> memory so that the OS can fully cache it instead of thrashing with I/O? Do
>> you see a lot of I/O or are the queries compute-bound?
>>
> Nice idea. The index is 200GB, the machine currently has 128GB RAM. We are
> using SSDs, but disappointingly, installing them didn't reduce search times
> to acceptable levels. I'll have to check your last question regarding
> I/O... I assume it is I/O bound, though will double check...
>
> Currently, we are using
>
> fsDirectory = new NRTCachingDirectory(fsDir, 5.0, 60.0);
>
> Are you proposing we increase maxCachedMB or use the RAMDirectory? With
> the latter, we will still need to persistent the index data to disk, as it
> is undergoing constant updates.
>
>
>> You said you have a 128GB machine, so that sounds small for your index.
>> Have you tried a 256GB machine?
>>
> Nope..didn't think it would make much of a different. I suppose, assuming
> we could store the entire index in RAM it would be helpful. How does one do
> this with Lucene, while still persisting the data?
>
>
>> How frequent are your commits for updates while doing queries?
>>
> Around ten to fifteen documents are being constantly added per second.
>
> Thank again
>
>
> Jamie
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: search performance

Posted by Jamie <ja...@mailarchiva.com>.

Jack

First off, thanks for applying your mind to our performance problem.

On 2014/06/02, 1:34 PM, Jack Krupansky wrote:
> Do you have enough system memory to fit the entire index in OS system 
> memory so that the OS can fully cache it instead of thrashing with 
> I/O? Do you see a lot of I/O or are the queries compute-bound?
Nice idea. The index is 200GB, the machine currently has 128GB RAM. We 
are using SSDs, but disappointingly, installing them didn't reduce 
search times to acceptable levels. I'll have to check your last question 
regarding I/O... I assume it is I/O bound, though will double check...

Currently, we are using

fsDirectory = new NRTCachingDirectory(fsDir, 5.0, 60.0);

Are you proposing we increase maxCachedMB or use the RAMDirectory? With 
the latter, we will still need to persistent the index data to disk, as 
it is undergoing constant updates.
>
> You said you have a 128GB machine, so that sounds small for your 
> index. Have you tried a 256GB machine?
Nope..didn't think it would make much of a different. I suppose, 
assuming we could store the entire index in RAM it would be helpful. How 
does one do this with Lucene, while still persisting the data?
>
> How frequent are your commits for updates while doing queries?
Around ten to fifteen documents are being constantly added per second.

Thank again

Jamie


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance

Posted by Jack Krupansky <ja...@basetechnology.com>.

Do you have enough system memory to fit the entire index in OS system memory 
so that the OS can fully cache it instead of thrashing with I/O? Do you see 
a lot of I/O or are the queries compute-bound?

You said you have a 128GB machine, so that sounds small for your index. Have 
you tried a 256GB machine?

How frequent are your commits for updates while doing queries?

-- Jack Krupansky

-----Original Message----- 
From: Jamie
Sent: Monday, June 2, 2014 2:51 AM
To: java-user@lucene.apache.org
Subject: search performance

Greetings

Despite following all the recommended optimizations (as described at
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed) , in some of
our installations, search performance has reached the point where is it
unacceptably slow. For instance, in one environment, the total index
size is 200GB, with 150 million documents indexed. With NRT enabled,
search speed is roughly 5 minutes on average. The server resources are:
2x6 Core Intel CPU, 128GB, 2 SSD for index and RAID 0, with Linux.

The only thing we haven't yet done, is to upgrade Lucene from 4.7.x to
4.8.x. Is this likely to make any noticeable difference in performance?

Clearly, longer term, we need to move to a distributed search model. We
thought to take advantage of the distributed search features offered in
Solr, however, our solution is very tightly integrated into Lucene
directly (since Solr didn't exist when we started out). Moving to Solr
now seems like a daunting prospect. We've also following the Katta
project with interest, but it doesn't appear support distributed
indexing, and development on it seems to have stalled. It would be nice
if there were a distributed search project on the Lucene level that we
could use.

I realize this is a rather vague question, but are there any further
suggestions on ways to improve search performance? We need cheap and
dirty ideas, as well as longer term advice on a possible path forward.

Much appreciate

Jamie

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: search performance

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

Jamie [jamie@mailarchiva.com] wrote:
> It would be nice if, in future, the Lucene API could provide a
> searchAfter that takes a position (int).

It would not really help with large result sets. At least not with the current underlying implementations. This is tied into your current performance problem, if I understand it correctly.

We seem to have isolated your performance problems to large (10M+) result sets, right?

Requesting the top X results in Lucene works internally by adding to a Priority Queue. The problem with PQs is that they work really well for small result sets and really bad for large result sets (note that result set refers to the collected documents, not to the amount of matching documents). PQs rearranges the internal structure each time a hit is entered that has a score >= the lowest known score. With millions of documents in the result set, this happens all the time. Abstractly there is little difference between small result sets and large: O(n * log n) is fine scaling. In reality the rearrangements of the internal heap structure only works well when it is in CPU cache.

To test this, I created the tiny project https://github.com/tokee/luso 

It simulates the workflow (for an extremely loose value of 'simulates') you described with extraction of a large result set by filling a PQ of a given size with docIDs (ints) and scores (floats) and then extracting the ordered docIDs. Running it with different sizes shows how the PQ deteriorates on a 4 core i7 with 8MB level 2 cache:

MAVEN_OPTS=-Xmx4g mvn -q exec:java -Dexec.args="pq 1 1000 10000 10000000000 1000000 5000000 10000000 20000000 30000000 40000000"

Starting 1 threads with extraction method pq
       1,000 docs in mean      15 ms,    66 docs/ms.
      10,000 docs in mean      47 ms,   212 docs/ms.
     100,000 docs in mean      65 ms, 1,538 docs/ms.
     500,000 docs in mean     385 ms, 1,298 docs/ms.
   1,000,000 docs in mean     832 ms, 1,201 docs/ms.
   5,000,000 docs in mean   7,566 ms,   660 docs/ms.
  10,000,000 docs in mean  16,482 ms,   606 docs/ms.
  20,000,000 docs in mean  39,481 ms,   506 docs/ms.
  30,000,000 docs in mean  80,293 ms,   373 docs/ms.
  40,000,000 docs in mean 109,537 ms,   365 docs/ms.

As can be seen, relative performance (docs/ms) drops significantly when the document count increases. To add insult to injury, this deterioration patters is optimistic as the test was the only heavy job on my computer. Running 4 of these tests in parallel (1 per core) we would ideally expect about the same speed, but instead we get

MAVEN_OPTS=-Xmx4g mvn -q exec:java -Dexec.args="pq 4 1000 10000 100000 500000 1000000 5000000 10000000 20000000 30000000 40000000"

Starting 4 threads with extraction method pq
       1,000 docs in mean      34 ms,    29 docs/ms.
      10,000 docs in mean      70 ms,   142 docs/ms.
     100,000 docs in mean     102 ms,   980 docs/ms.
     500,000 docs in mean   1,340 ms,   373 docs/ms.
   1,000,000 docs in mean   2,564 ms,   390 docs/ms.
   5,000,000 docs in mean  19,464 ms,   256 docs/ms.
  10,000,000 docs in mean  49,985 ms,   200 docs/ms.
  20,000,000 docs in mean 112,321 ms,   178 docs/ms.
(I got tired of waiting and stopped after 20M docs)

The conclusion seems clear enough: Using PQ for millions of results will take a long time.

So what can be done? I added an alternative implementation where all the docIDs and scores are collected in two parallel arrays, then merge sorted after collection. That gave the results

MAVEN_OPTS=-Xmx4g mvn -q exec:java -Dexec.args="ip 1 1000 10000 100000 500000 1000000 5000000 10000000 20000000 30000000 40000000"
Starting 1 threads with extraction method ip
       1,000 docs in mean      15 ms,    66 docs/ms.
      10,000 docs in mean      52 ms,   192 docs/ms.
     100,000 docs in mean      73 ms, 1,369 docs/ms.
     500,000 docs in mean     363 ms, 1,377 docs/ms.
   1,000,000 docs in mean     780 ms, 1,282 docs/ms.
   5,000,000 docs in mean   4,634 ms, 1,078 docs/ms.
  10,000,000 docs in mean   9,708 ms, 1,030 docs/ms.
  20,000,000 docs in mean  20,818 ms,   960 docs/ms.
  30,000,000 docs in mean  32,413 ms,   925 docs/ms.
  40,000,000 docs in mean  44,235 ms,   904 docs/ms.

Notice how the deterioration of relative speed is a lot less than for PQ. Running this with 4 threads gets us

 MAVEN_OPTS=-Xmx4g mvn -q exec:java -Dexec.args="ip 4 1000 10000 100000 500000 1000000 5000000 10000000 20000000 30000000 40000000"
Starting 4 threads with extraction method ip
       1,000 docs in mean      35 ms,    28 docs/ms.
      10,000 docs in mean     221 ms,    45 docs/ms.
     100,000 docs in mean     162 ms,   617 docs/ms.
     500,000 docs in mean     639 ms,   782 docs/ms.
   1,000,000 docs in mean   1,388 ms,   720 docs/ms.
   5,000,000 docs in mean   8,372 ms,   597 docs/ms.
  10,000,000 docs in mean  17,933 ms,   557 docs/ms.
  20,000,000 docs in mean  36,031 ms,   555 docs/ms.
  30,000,000 docs in mean  58,257 ms,   514 docs/ms.
  40,000,000 docs in mean  76,763 ms,   521 docs/ms.

The speedup of the merge sorter relative to PQ increases with the size of the collected result. Unfortunately we're still talking minute-class with 60M documents. It all points to the conclusion that collecting millions of sorted document IDs should be avoided if at all possible. A searchAfter that takes a position would either need to use some clever caching or perform the giant sorted collection when called.

- Toke Eskildsen
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance

Posted by Jamie <ja...@mailarchiva.com>.

Robert

FYI: I've modified the code to utilize the experimental function..

     DirectoryReader dirReader = 
DirectoryReader.openIfChanged(cachedDirectoryReader,writer, true);

In this case, the IndexReader won't be opened on each search, unless 
absolutely necessary.

Regards

Jamie

On 2014/06/03, 1:25 PM, Jamie wrote:



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance

Posted by Jamie <ja...@mailarchiva.com>.

Greetings Lucene Users

As a follow-up to my earlier mail:

We are also using Lucene segment warmers, as per recommendation, 
segments per tier is now set to five, buffer memory is set to 
(Runtime.getRuntime().totalMemory()*.08)/1024/1024;

See below for code used to instantiate writer:

                     LimitTokenCountAnalyzer limitAnalyzer = new 
LimitTokenCountAnalyzer(application.getAnalyzerFactory().getAnalyzer(language, 
AnalyzerFactory.Operation.INDEX), maxPerFieldTokens);
                     IndexWriterConfig conf = new 
IndexWriterConfig(Version.LUCENE_46, limitAnalyzer);
                     TieredMergePolicy logMergePolicy = new 
TieredMergePolicy();
                     logMergePolicy.setSegmentsPerTier(5);
                     conf.setMergePolicy(logMergePolicy);
                      conf.setRAMBufferSizeMB(bufferMemoryMB);
                     writer = new IndexWriter(fsDirectory, conf);
writer.getConfig().setMergedSegmentWarmer(readerWarmer);


This particular monster 24 core machine has 110G of RAM. I suppose one 
possibly is to load the indexes that aren' t being changed into RAM on 
startup. However, the indexes are already residing on fast SSD drives.

We're using the following JRE parameters:

-XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:SurvivorRatio=3 
-XX:+AggressiveOpts.

Let me know if there is anything else, we can try to obtain performance 
gains.

Much appreciate

Jamie
On 2014/06/20, 9:51 AM, Jamie wrote:
> Hi All
>
> Thank you for all your suggestions. Some of the recommendations hadn't 
> yet been implemented, as our code base was using older versions of 
> Lucene with reduced capabilities. Thus, far, all the recommendations  
> for fast search have been implemented (e.g. using pagination with 
> searchAfter, DirectoryReader.openIfChanged, avoiding wrapping lucene 
> scoreDoc results, option to disable sorting, etc.).
>
> While, in some environments, search performance has improved 
> significantly, in other larger ones we are unfortunately, still seeing 
> 1 minute - 5 minute search times. For instance, in one site, the total 
> index size is 500GB with 190 million documents indexed. They are 
> running a machine with 24 core and 4 SSD drives to house the indexes. 
> New emails are being added to the indexes at a rate of 10 message/sec.
>
> One area possible area for improvement: Searching is being conducted 
> across several indexes. To accomplish this, on each search, a 
> MultiReader is constructed, that consists of several subreaders 
> created by the DirectoryReader.openIfChangedMethod. Only one of the 
> indexes is updated frequently, the others are never updated.  For each 
> search, a new IndexSearcher is created passed the MultiReader in the 
> constructor. From what I've read, MultiReader and IndexSearcher are 
> relatively lightweight and should not impact search performance. Is 
> this correct? Is there a faster way to handle searching across 
> multiple indexes? What is the performance impact of searching across 
> multiple indexes?
>
> Am I correct that using SearchManager can't be used with a MultiReader 
> and NRT?  I would appreciate all suggestions on how to optimize our 
> search performance further. Search time has become a usability issue.
>
> Much appreciate
>
> Jamie


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance

Posted by Vitaly Funstein <vf...@gmail.com>.

If you are using stored fields in your index, consider playing with
compression settings, or perhaps turning stored field compression off
altogether. Ways to do this have been discussed in this forum on numerous
occasions. This is highly use case dependent though, as your indexing
performance may or may not suffer, as a tradeoff.

On Fri, Jun 20, 2014 at 1:19 AM, Uwe Schindler <uw...@thetaphi.de> wrote:

> Hi,
>
> > Am I correct that using SearchManager can't be used with a MultiReader
> and
> > NRT?  I would appreciate all suggestions on how to optimize our search
> > performance further. Search time has become a usability issue.
>
> Just have a SearcherManger for every index. MultiReader construction is
> cheap (it is just a wrapper, there is no overhead), so you can ask all
> searcherManagers for the actual IndexReader and build the MultiReader on
> every search request.
>
> Uwe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: search performance

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,

> Am I correct that using SearchManager can't be used with a MultiReader and
> NRT?  I would appreciate all suggestions on how to optimize our search
> performance further. Search time has become a usability issue.

Just have a SearcherManger for every index. MultiReader construction is cheap (it is just a wrapper, there is no overhead), so you can ask all searcherManagers for the actual IndexReader and build the MultiReader on every search request.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance

Posted by Jamie <ja...@mailarchiva.com>.

Hi All

Thank you for all your suggestions. Some of the recommendations hadn't 
yet been implemented, as our code base was using older versions of 
Lucene with reduced capabilities. Thus, far, all the recommendations  
for fast search have been implemented (e.g. using pagination with 
searchAfter, DirectoryReader.openIfChanged, avoiding wrapping lucene 
scoreDoc results, option to disable sorting, etc.).

While, in some environments, search performance has improved 
significantly, in other larger ones we are unfortunately, still seeing 1 
minute - 5 minute search times. For instance, in one site, the total 
index size is 500GB with 190 million documents indexed. They are running 
a machine with 24 core and 4 SSD drives to house the indexes. New emails 
are being added to the indexes at a rate of 10 message/sec.

One area possible area for improvement: Searching is being conducted 
across several indexes. To accomplish this, on each search, a 
MultiReader is constructed, that consists of several subreaders created 
by the DirectoryReader.openIfChangedMethod. Only one of the indexes is 
updated frequently, the others are never updated.  For each search, a 
new IndexSearcher is created passed the MultiReader in the constructor. 
 From what I've read, MultiReader and IndexSearcher are relatively 
lightweight and should not impact search performance. Is this correct? 
Is there a faster way to handle searching across multiple indexes? What 
is the performance impact of searching across multiple indexes?

Am I correct that using SearchManager can't be used with a MultiReader 
and NRT?  I would appreciate all suggestions on how to optimize our 
search performance further. Search time has become a usability issue.

Much appreciate

Jamie

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance

Posted by Jamie <ja...@mailarchiva.com>.

Jon

I ended up adapting your approach. The solution involves keeping a LRU 
cache of page boundary scoredocs and their respective positions. New 
positions are added to the cache as new pages are discovered. To cut 
down on searches, when scrolling backwards and forwards, the search 
begins from nearest cached position.

Cheers

Jamie

On 2014/06/03, 3:24 PM, Jon Stewart wrote:
> With regards to pagination, is there a way for you to cache the
> IndexSearcher, Query, and TopDocs between user pagination requests (a
> lot of webapp frameworks have object caching mechanisms)? If so, you
> may have luck with code like this:
>
>    void ensureTopDocs(final int rank) throws IOException {
>      if (StartDocIndex > rank) {
>        Docs = Searcher.search(SearchQuery, TOP_DOCS_WINDOW);
>        StartDocIndex = 0;
>      }
>      int len = Docs.scoreDocs.length;
>      while (StartDocIndex + len <= rank) {
>        StartDocIndex += len;
>        Docs = Searcher.searchAfter(Docs.scoreDocs[len - 1],
> SearchQuery, TOP_DOCS_WINDOW);
>        len = Docs.scoreDocs.length;
>      }
>    }
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance

Posted by Jamie <ja...@mailarchiva.com>.

Thanks Jon

I'll investigate your idea further.

It would be nice if, in future, the Lucene API could provide a 
searchAfter that takes a position (int).

Regards

Jamie

On 2014/06/03, 3:24 PM, Jon Stewart wrote:
> With regards to pagination, is there a way for you to cache the
> IndexSearcher, Query, and TopDocs between user pagination requests (a
> lot of webapp frameworks have object caching mechanisms)? If so, you
> may have luck with code like this:
>
>    void ensureTopDocs(final int rank) throws IOException {
>      if (StartDocIndex > rank) {
>        Docs = Searcher.search(SearchQuery, TOP_DOCS_WINDOW);
>        StartDocIndex = 0;
>      }
>      int len = Docs.scoreDocs.length;
>      while (StartDocIndex + len <= rank) {
>        StartDocIndex += len;
>        Docs = Searcher.searchAfter(Docs.scoreDocs[len - 1],
> SearchQuery, TOP_DOCS_WINDOW);
>        len = Docs.scoreDocs.length;
>      }
>    }
>
> StartDocIndex is a member variable denoting the current rank of the
> first item in TopDocs ("Docs") window. I call this function before
> each Document retrieval. The common case--of the user looking at the
> first page of results or the user advancing to the next page--is quite
> fast. But it still supports random access, albeit not in constant
> time. OTOH, if your app is concurrent, most search queries will
> probably be returned very quickly so the odd query that wants to jump
> deep into the result set will have more of the server's resources
> available to it.
>
> Also, given the size of your result sets, you have to allocate a lot
> of memory upfront which will then get gc'd after some time. From query
> to query, you will have a decent amount of memory churn. This isn't
> free. My guess is using Lucene's linear (search() & searchAfter())
> pagination will perform faster than your current approach just based
> upon not having to create such large arrays.
>
> I'm not the Lucene expert that Robert is, but this has worked alright for me.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance

Posted by Jon Stewart <jo...@lightboxtechnologies.com>.

With regards to pagination, is there a way for you to cache the
IndexSearcher, Query, and TopDocs between user pagination requests (a
lot of webapp frameworks have object caching mechanisms)? If so, you
may have luck with code like this:

  void ensureTopDocs(final int rank) throws IOException {
    if (StartDocIndex > rank) {
      Docs = Searcher.search(SearchQuery, TOP_DOCS_WINDOW);
      StartDocIndex = 0;
    }
    int len = Docs.scoreDocs.length;
    while (StartDocIndex + len <= rank) {
      StartDocIndex += len;
      Docs = Searcher.searchAfter(Docs.scoreDocs[len - 1],
SearchQuery, TOP_DOCS_WINDOW);
      len = Docs.scoreDocs.length;
    }
  }

StartDocIndex is a member variable denoting the current rank of the
first item in TopDocs ("Docs") window. I call this function before
each Document retrieval. The common case--of the user looking at the
first page of results or the user advancing to the next page--is quite
fast. But it still supports random access, albeit not in constant
time. OTOH, if your app is concurrent, most search queries will
probably be returned very quickly so the odd query that wants to jump
deep into the result set will have more of the server's resources
available to it.

Also, given the size of your result sets, you have to allocate a lot
of memory upfront which will then get gc'd after some time. From query
to query, you will have a decent amount of memory churn. This isn't
free. My guess is using Lucene's linear (search() & searchAfter())
pagination will perform faster than your current approach just based
upon not having to create such large arrays.

I'm not the Lucene expert that Robert is, but this has worked alright for me.

cheers,

Jon

On Tue, Jun 3, 2014 at 8:47 AM, Jamie <ja...@mailarchiva.com> wrote:
> Robert. Thanks, I've already done a similar thing. Results on my test
> platform are encouraging..
>
>
> On 2014/06/03, 2:41 PM, Robert Muir wrote:
>>
>> Reopening for every search is not a good idea. this will have an
>> extremely high cost (not as high as what you are doing with "paging"
>> but still not good).
>>
>> Instead consider making it near-realtime, by doing this every second
>> or so instead. Look at SearcherManager for code that helps you do
>> this.
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

-- 
Jon Stewart, Principal
(646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance

Posted by Jamie <ja...@mailarchiva.com>.

Robert. Thanks, I've already done a similar thing. Results on my test 
platform are encouraging..

On 2014/06/03, 2:41 PM, Robert Muir wrote:
> Reopening for every search is not a good idea. this will have an
> extremely high cost (not as high as what you are doing with "paging"
> but still not good).
>
> Instead consider making it near-realtime, by doing this every second
> or so instead. Look at SearcherManager for code that helps you do
> this.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance

Posted by Robert Muir <rc...@gmail.com>.

Reopening for every search is not a good idea. this will have an
extremely high cost (not as high as what you are doing with "paging"
but still not good).

Instead consider making it near-realtime, by doing this every second
or so instead. Look at SearcherManager for code that helps you do
this.

On Tue, Jun 3, 2014 at 7:25 AM, Jamie <ja...@mailarchiva.com> wrote:
> Robert
>
> Hmmm..... why did Mike go to all the trouble of implementing NRT search, if
> we are not supposed to be using it?
>
> The user simply wants the latest result set. To me, this doesn't appear out
> of scope for the Lucene project.
>
> Jamie
>
>
> On 2014/06/03, 1:17 PM, Robert Muir wrote:
>>
>> No, you are incorrect. The point of a search engine is to return top-N
>> most relevant.
>>
>> If you insist you need to open an indexreader on every single search,
>> and then return huge amounts of docs, maybe you should use a database
>> instead.
>>
>> On Tue, Jun 3, 2014 at 6:42 AM, Jamie <ja...@mailarchiva.com> wrote:
>>>
>>> Vitality / Robert
>>>
>>> I wouldn't go so far as to call our pagination naive!? Sub-optimal, yes.
>>> Unless I am mistaken, the Lucene library's pagination mechanism, makes
>>> the
>>> assumption that you will cache the scoredocs for the entire result set.
>>> This
>>> is not practical  when you have a result set that exceeds 60M. As stated
>>> earlier, in any case, it is the first query that is slow.
>>>
>>> We do open index readers.. since we are using NRT search. Since documents
>>> are being added to the indexes on a continuous basis. When the user
>>> clicks
>>> on the Search button, the user will expect to see the latest result set.
>>> With regards to NRT search, my understanding is that we do need to open
>>> the
>>> index readers on each search operation to see the latest changes.
>>>
>>> Thus, on each search, we combine the indexreaders into a multireader, and
>>> open each reader based their corresponding writer.
>>>
>>> protected IndexReader initIndexReader() {
>>>      List<IndexReader> readers = new LinkedList<>();
>>>      for (Writer writer : writers) {
>>>          readers.add(DirectoryReader.open(writer, true);
>>>      }
>>>      return MultiReader(readers,true);
>>> }
>>>
>>> Thank you for your ideas/suggestions.
>>>
>>> Regards
>>>
>>> Jamie
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance

Posted by Jamie <ja...@mailarchiva.com>.

Robert

Hmmm..... why did Mike go to all the trouble of implementing NRT search, 
if we are not supposed to be using it?

The user simply wants the latest result set. To me, this doesn't appear 
out of scope for the Lucene project.

Jamie

On 2014/06/03, 1:17 PM, Robert Muir wrote:
> No, you are incorrect. The point of a search engine is to return top-N
> most relevant.
>
> If you insist you need to open an indexreader on every single search,
> and then return huge amounts of docs, maybe you should use a database
> instead.
>
> On Tue, Jun 3, 2014 at 6:42 AM, Jamie <ja...@mailarchiva.com> wrote:
>> Vitality / Robert
>>
>> I wouldn't go so far as to call our pagination naive!? Sub-optimal, yes.
>> Unless I am mistaken, the Lucene library's pagination mechanism, makes the
>> assumption that you will cache the scoredocs for the entire result set. This
>> is not practical  when you have a result set that exceeds 60M. As stated
>> earlier, in any case, it is the first query that is slow.
>>
>> We do open index readers.. since we are using NRT search. Since documents
>> are being added to the indexes on a continuous basis. When the user clicks
>> on the Search button, the user will expect to see the latest result set.
>> With regards to NRT search, my understanding is that we do need to open the
>> index readers on each search operation to see the latest changes.
>>
>> Thus, on each search, we combine the indexreaders into a multireader, and
>> open each reader based their corresponding writer.
>>
>> protected IndexReader initIndexReader() {
>>      List<IndexReader> readers = new LinkedList<>();
>>      for (Writer writer : writers) {
>>          readers.add(DirectoryReader.open(writer, true);
>>      }
>>      return MultiReader(readers,true);
>> }
>>
>> Thank you for your ideas/suggestions.
>>
>> Regards
>>
>> Jamie


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance

Posted by Robert Muir <rc...@gmail.com>.

No, you are incorrect. The point of a search engine is to return top-N
most relevant.

If you insist you need to open an indexreader on every single search,
and then return huge amounts of docs, maybe you should use a database
instead.

On Tue, Jun 3, 2014 at 6:42 AM, Jamie <ja...@mailarchiva.com> wrote:
> Vitality / Robert
>
> I wouldn't go so far as to call our pagination naive!? Sub-optimal, yes.
> Unless I am mistaken, the Lucene library's pagination mechanism, makes the
> assumption that you will cache the scoredocs for the entire result set. This
> is not practical  when you have a result set that exceeds 60M. As stated
> earlier, in any case, it is the first query that is slow.
>
> We do open index readers.. since we are using NRT search. Since documents
> are being added to the indexes on a continuous basis. When the user clicks
> on the Search button, the user will expect to see the latest result set.
> With regards to NRT search, my understanding is that we do need to open the
> index readers on each search operation to see the latest changes.
>
> Thus, on each search, we combine the indexreaders into a multireader, and
> open each reader based their corresponding writer.
>
> protected IndexReader initIndexReader() {
>     List<IndexReader> readers = new LinkedList<>();
>     for (Writer writer : writers) {
>         readers.add(DirectoryReader.open(writer, true);
>     }
>     return MultiReader(readers,true);
> }
>
> Thank you for your ideas/suggestions.
>
> Regards
>
> Jamie
>
> On 2014/06/03, 12:29 PM, Vitaly Funstein wrote:
>>
>> Jamie,
>>
>> What if you were to forget for a moment the whole pagination idea, and
>> always capped your search at 1000 results for testing purposes only? This
>> is just to try and pinpoint the bottleneck here; if, regardless of the
>> query parameters, the search latency stays roughly the same and well below
>> 5 min, you now have the answer - the problem is your naive implementation
>> of pagination which results in snowballing result numbers and search
>> times,
>> the closer you get to the end of the results range. Otherwise, I would
>> focus on your query and filter next.
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance

Posted by Jamie <ja...@mailarchiva.com>.

Vitality / Robert

I wouldn't go so far as to call our pagination naive!? Sub-optimal, yes. 
Unless I am mistaken, the Lucene library's pagination mechanism, makes 
the assumption that you will cache the scoredocs for the entire result 
set. This is not practical  when you have a result set that exceeds 60M. 
As stated earlier, in any case, it is the first query that is slow.

We do open index readers.. since we are using NRT search. Since 
documents are being added to the indexes on a continuous basis. When the 
user clicks on the Search button, the user will expect to see the latest 
result set. With regards to NRT search, my understanding is that we do 
need to open the index readers on each search operation to see the 
latest changes.

Thus, on each search, we combine the indexreaders into a multireader, 
and open each reader based their corresponding writer.

protected IndexReader initIndexReader() {
     List<IndexReader> readers = new LinkedList<>();
     for (Writer writer : writers) {
         readers.add(DirectoryReader.open(writer, true);
     }
     return MultiReader(readers,true);
}

Thank you for your ideas/suggestions.

Regards

Jamie
On 2014/06/03, 12:29 PM, Vitaly Funstein wrote:
> Jamie,
>
> What if you were to forget for a moment the whole pagination idea, and
> always capped your search at 1000 results for testing purposes only? This
> is just to try and pinpoint the bottleneck here; if, regardless of the
> query parameters, the search latency stays roughly the same and well below
> 5 min, you now have the answer - the problem is your naive implementation
> of pagination which results in snowballing result numbers and search times,
> the closer you get to the end of the results range. Otherwise, I would
> focus on your query and filter next.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance

Posted by Vitaly Funstein <vf...@gmail.com>.

Jamie,

What if you were to forget for a moment the whole pagination idea, and
always capped your search at 1000 results for testing purposes only? This
is just to try and pinpoint the bottleneck here; if, regardless of the
query parameters, the search latency stays roughly the same and well below
5 min, you now have the answer - the problem is your naive implementation
of pagination which results in snowballing result numbers and search times,
the closer you get to the end of the results range. Otherwise, I would
focus on your query and filter next.


On Tue, Jun 3, 2014 at 3:21 AM, Jamie <ja...@mailarchiva.com> wrote:

> Vitaly
>
> See below:
>
>
> On 2014/06/03, 12:09 PM, Vitaly Funstein wrote:
>
>> A couple of questions.
>>
>> 1. What are you trying to achieve by setting the current thread's priority
>> to max possible value? Is it grabbing as much CPU time as possible? In my
>> experience, mucking with thread priorities like this is at best futile,
>> and
>> at worst quite detrimental to responsiveness and overall performance of
>> the
>> system as a whole. I would remove that line.
>>
> Yes,  you are right to be worried about this, especially since thread
> priorities behave differently on different platforms.
>
>
>
>> 2. This seems suspicious:
>>
>> if (getPagination()) {
>>                  max = start + length;
>>              } else {
>>                  max = getMaxResults();
>>              }
>>
>> If start is at 100M, and length is 1000 - what do you think Lucene will
>> try
>> and do when you pass this max to the collector?
>>
> I dont see the problem here. The collector will start from zero to max
> results. I agree that from a performance perspective, ts not ideal to
> return all results from the beginning of the search, but the Lucene API us
> with no choice. I simply do not know the ScoreDoc to start from. If I did
> keep a record of it, then I would need to store all scoredocs for the
> entire result set. When there are 60M+ results, this can be problematic in
> terms of memory consumption. It would be far nicer if there was a
> searchAfter function that took a position as an integer.
>
> Regards
>
> Jamie
>
>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: search performance

Posted by Jamie <ja...@mailarchiva.com>.

Vitaly

See below:

On 2014/06/03, 12:09 PM, Vitaly Funstein wrote:
> A couple of questions.
>
> 1. What are you trying to achieve by setting the current thread's priority
> to max possible value? Is it grabbing as much CPU time as possible? In my
> experience, mucking with thread priorities like this is at best futile, and
> at worst quite detrimental to responsiveness and overall performance of the
> system as a whole. I would remove that line.
Yes,  you are right to be worried about this, especially since thread 
priorities behave differently on different platforms.

>
> 2. This seems suspicious:
>
> if (getPagination()) {
>                  max = start + length;
>              } else {
>                  max = getMaxResults();
>              }
>
> If start is at 100M, and length is 1000 - what do you think Lucene will try
> and do when you pass this max to the collector?
I dont see the problem here. The collector will start from zero to max 
results. I agree that from a performance perspective, ts not ideal to 
return all results from the beginning of the search, but the Lucene API 
us with no choice. I simply do not know the ScoreDoc to start from. If I 
did keep a record of it, then I would need to store all scoredocs for 
the entire result set. When there are 60M+ results, this can be 
problematic in terms of memory consumption. It would be far nicer if 
there was a searchAfter function that took a position as an integer.

Regards

Jamie
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance

Posted by Vitaly Funstein <vf...@gmail.com>.

A couple of questions.

1. What are you trying to achieve by setting the current thread's priority
to max possible value? Is it grabbing as much CPU time as possible? In my
experience, mucking with thread priorities like this is at best futile, and
at worst quite detrimental to responsiveness and overall performance of the
system as a whole. I would remove that line.

2. This seems suspicious:

if (getPagination()) {
                max = start + length;
            } else {
                max = getMaxResults();
            }

If start is at 100M, and length is 1000 - what do you think Lucene will try
and do when you pass this max to the collector?

On Tue, Jun 3, 2014 at 2:55 AM, Jamie <ja...@mailarchiva.com> wrote:

> FYI: We are also using a multireader to search over multiple index readers.
>
> Search under a million documents yields good response times. When you get
> into the 60M territory, search slows to a crawl.
>
> On 2014/06/03, 11:47 AM, Jamie wrote:
>
>> Sure... see below:
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: search performance

Posted by Jamie <ja...@mailarchiva.com>.

FYI: We are also using a multireader to search over multiple index readers.

Search under a million documents yields good response times. When you 
get into the 60M territory, search slows to a crawl.

On 2014/06/03, 11:47 AM, Jamie wrote:
> Sure... see below:


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance

Posted by Jamie <ja...@mailarchiva.com>.

Sure... see below:

     protected void search(Query query, Filter queryFilter, Sort sort)
             throws BlobSearchException {
         try {
             logger.debug("start search  {searchquery='" + 
getSearchQuery() + 
"',query='"+query.toString()+"',filterQuery='"+queryFilter+"',sort='"+sort+"'}");
             Thread.currentThread().setPriority(Thread.MAX_PRIORITY);
             results.clear();

             int max;

             if (getPagination()) {
                 max = start + length;
             } else {
                 max = getMaxResults();
             }

             // release the old volume searchers
             IndexReader indexReader = initIndexReader();
             searcher = new IndexSearcher(indexReader,executor);
             TopFieldCollector fieldCollector = 
TopFieldCollector.create(sort, max,true, false, false, true);

             searcher.search(query, queryFilter, fieldCollector);

             TopDocs topDocs;

             if (getPagination()) {
                 topDocs = fieldCollector.topDocs(start,length);
             } else {
                 topDocs = fieldCollector.topDocs();
             }

             int count = 0;
             for (int i = 0; i < topDocs.scoreDocs.length; i++) {
                 if ((getMaxResults()>0 && count > getMaxResults()) || 
(getPagination() && count++>=length)) { break; }
                 results.add(topDocs.scoreDocs[i]);
             }

             totalHits = fieldCollector.getTotalHits();

             logger.debug("search executed successfully {query='"+ 
getSearchQuery() + "',returnedresults='" + results.size()+ "'}");
         } catch (Exception io) {
             throw new BlobSearchException("failed to execute search 
query {searchquery='"+ getSearchQuery() + "}", io, logger, 
ChainedException.Level.DEBUG);
         }
     }
On 2014/06/03, 11:41 AM, Rob Audenaerde wrote:
> Hi Jamie,
>
> What is included in the 5 minutes?
>
> Just the call to the searcher?
>
> seacher.search(...) ?
>
> Can you show a bit more of the code you use?
>
>
>
> On Tue, Jun 3, 2014 at 11:32 AM, Jamie <ja...@mailarchiva.com> wrote:
>
>> Vitaly
>>
>> Thanks for the contribution. Unfortunately, we cannot use Lucene's
>> pagination function, because in reality the user can skip pages to start
>> the search at any point, not just from the end of the previous search. Even
>> the
>> first search (without any pagination), with a max of 1000 hits, takes 5
>> minutes to complete.
>>
>> Regards
>>
>> Jamie
>>
>> On 2014/06/03, 10:54 AM, Vitaly Funstein wrote:
>>
>>> Something doesn't quite add up.
>>>
>>> TopFieldCollector fieldCollector = TopFieldCollector.create(sort,
>>> max,true,
>>>
>>>> false, false, true);
>>>>
>>>> We use pagination, so only returning 1000 documents or so at a time.
>>>>
>>>>
>>>>   You say you are using pagination, yet the API you are using to create
>>> your
>>> collector isn't how you would utilize Lucene's built-in "pagination"
>>> feature (unless misunderstand the API). If the max is the snippet above is
>>> 1000, then you're simply returning top 1000 docs every time you execute
>>> your search. Otherwise... well, could you actually post a bit more of your
>>> code that runs the search here, in particular?
>>>
>>> Assuming that the max is much larger than 1000, however, you could call
>>> fieldCollector.topDocs(int, int) after accumulating hits using this
>>> collector, but this won't work multiple times per query execution,
>>> according to the javadoc. So you either have to re-execute the full
>>> search,
>>> and then get the next chunk of ScoreDocs, or use the proper API for this,
>>> one that accepts as a parameter the end of the previous page of results,
>>> i.e. IndexSearcher.searchAfter(ScoreDoc, ...)
>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance

Posted by Rob Audenaerde <ro...@gmail.com>.

Hi Jamie,

What is included in the 5 minutes?

Just the call to the searcher?

seacher.search(...) ?

Can you show a bit more of the code you use?



On Tue, Jun 3, 2014 at 11:32 AM, Jamie <ja...@mailarchiva.com> wrote:

> Vitaly
>
> Thanks for the contribution. Unfortunately, we cannot use Lucene's
> pagination function, because in reality the user can skip pages to start
> the search at any point, not just from the end of the previous search. Even
> the
> first search (without any pagination), with a max of 1000 hits, takes 5
> minutes to complete.
>
> Regards
>
> Jamie
>
> On 2014/06/03, 10:54 AM, Vitaly Funstein wrote:
>
>> Something doesn't quite add up.
>>
>> TopFieldCollector fieldCollector = TopFieldCollector.create(sort,
>> max,true,
>>
>>> false, false, true);
>>>
>>> We use pagination, so only returning 1000 documents or so at a time.
>>>
>>>
>>>  You say you are using pagination, yet the API you are using to create
>> your
>> collector isn't how you would utilize Lucene's built-in "pagination"
>> feature (unless misunderstand the API). If the max is the snippet above is
>> 1000, then you're simply returning top 1000 docs every time you execute
>> your search. Otherwise... well, could you actually post a bit more of your
>> code that runs the search here, in particular?
>>
>> Assuming that the max is much larger than 1000, however, you could call
>> fieldCollector.topDocs(int, int) after accumulating hits using this
>> collector, but this won't work multiple times per query execution,
>> according to the javadoc. So you either have to re-execute the full
>> search,
>> and then get the next chunk of ScoreDocs, or use the proper API for this,
>> one that accepts as a parameter the end of the previous page of results,
>> i.e. IndexSearcher.searchAfter(ScoreDoc, ...)
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: search performance

Posted by Jamie <ja...@mailarchiva.com>.

Vitaly

Thanks for the contribution. Unfortunately, we cannot use Lucene's 
pagination function, because in reality the user can skip pages to start 
the search at any point, not just from the end of the previous search. 
Even the
first search (without any pagination), with a max of 1000 hits, takes 5 
minutes to complete.

Regards

Jamie
On 2014/06/03, 10:54 AM, Vitaly Funstein wrote:
> Something doesn't quite add up.
>
> TopFieldCollector fieldCollector = TopFieldCollector.create(sort, max,true,
>> false, false, true);
>>
>> We use pagination, so only returning 1000 documents or so at a time.
>>
>>
> You say you are using pagination, yet the API you are using to create your
> collector isn't how you would utilize Lucene's built-in "pagination"
> feature (unless misunderstand the API). If the max is the snippet above is
> 1000, then you're simply returning top 1000 docs every time you execute
> your search. Otherwise... well, could you actually post a bit more of your
> code that runs the search here, in particular?
>
> Assuming that the max is much larger than 1000, however, you could call
> fieldCollector.topDocs(int, int) after accumulating hits using this
> collector, but this won't work multiple times per query execution,
> according to the javadoc. So you either have to re-execute the full search,
> and then get the next chunk of ScoreDocs, or use the proper API for this,
> one that accepts as a parameter the end of the previous page of results,
> i.e. IndexSearcher.searchAfter(ScoreDoc, ...)
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance

Posted by Vitaly Funstein <vf...@gmail.com>.

Something doesn't quite add up.

TopFieldCollector fieldCollector = TopFieldCollector.create(sort, max,true,
> false, false, true);
>
> We use pagination, so only returning 1000 documents or so at a time.
>
>
You say you are using pagination, yet the API you are using to create your
collector isn't how you would utilize Lucene's built-in "pagination"
feature (unless misunderstand the API). If the max is the snippet above is
1000, then you're simply returning top 1000 docs every time you execute
your search. Otherwise... well, could you actually post a bit more of your
code that runs the search here, in particular?

Assuming that the max is much larger than 1000, however, you could call
fieldCollector.topDocs(int, int) after accumulating hits using this
collector, but this won't work multiple times per query execution,
according to the javadoc. So you either have to re-execute the full search,
and then get the next chunk of ScoreDocs, or use the proper API for this,
one that accepts as a parameter the end of the previous page of results,
i.e. IndexSearcher.searchAfter(ScoreDoc, ...)

Re: search performance

Posted by Jamie <ja...@mailarchiva.com>.

Tom

Thanks for the offer of assistance.

On 2014/06/02, 12:02 PM, Tincu Gabriel wrote:
> What kind of queries are you pushing into the index.
We are indexing regular emails + attachments.

Typical query is something like:
filter: to:mbox000008 from:mbox000008 cc:mbox000008 bcc:mbox000008 
deliveredto:mbox000008 sender:mbox000008 recipient:mbox000008
combined with filter query "cat:email"

We also use range queries based on date.
> Do they match a lot of documents ?
Yes, although we are using a  collector...

TopFieldCollector fieldCollector = TopFieldCollector.create(sort, 
max,true, false, false, true);

We use pagination, so only returning 1000 documents or so at a time.

>   Do you do any sorting on the result set?
Yes
>   What is the average
> document size ?
approx 100KB, We are indexing email body + attachment content.
> Do you have a lot of update traffic ?
Yes we have alot of update traffic, particularly in the environment i 
referred to. Is there a way to prioritize searching as apposed to update?

I suppose we could block all indexing while searching is on the go? Is 
there such as option in Lucene, or should we implement this?
> What kind of schema
> does your index use ?
Not sure exactly what you are referring to here. We do have alot of 
stored fields (to, from bcc, cc, etc.). The body and attachments are 
analyzed.

Regards

Jamie
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance

Posted by Tincu Gabriel <ti...@gmail.com>.

What kind of queries are you pushing into the index. Do they match a lot of
documents ? Do you do any sorting on the result set? What is the average
document size ? Do you have a lot of update traffic ? What kind of schema
does your index use ?


On Mon, Jun 2, 2014 at 6:51 AM, Jamie <ja...@mailarchiva.com> wrote:

> Greetings
>
> Despite following all the recommended optimizations (as described at
> http://wiki.apache.org/lucene-java/ImproveSearchingSpeed) , in some of
> our installations, search performance has reached the point where is it
> unacceptably slow. For instance, in one environment, the total index size
> is 200GB, with 150 million documents indexed. With NRT enabled, search
> speed is roughly 5 minutes on average. The server resources are: 2x6 Core
> Intel CPU, 128GB, 2 SSD for index and RAID 0, with Linux.
>
> The only thing we haven't yet done, is to upgrade Lucene from 4.7.x to
> 4.8.x. Is this likely to make any noticeable difference in performance?
>
> Clearly, longer term, we need to move to a distributed search model. We
> thought to take advantage of the distributed search features offered in
> Solr, however, our solution is very tightly integrated into Lucene directly
> (since Solr didn't exist when we started out). Moving to Solr now seems
> like a daunting prospect. We've also following the Katta project with
> interest, but it doesn't appear support distributed indexing, and
> development on it seems to have stalled. It would be nice if there were a
> distributed search project on the Lucene level that we could use.
>
> I realize this is a rather vague question, but are there any further
> suggestions on ways to improve search performance? We need cheap and dirty
> ideas, as well as longer term advice on a possible path forward.
>
> Much appreciate
>
> Jamie
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: search performance

Posted by Christoph Kaser <lu...@iconparc.de>.

Can you take thread stacktraces (repeatedly) during those 5 minute 
searches? That might give you (or someone on the mailing list) a clue 
where all that time is spent.
You could try using jstack for that: 
http://docs.oracle.com/javase/7/docs/technotes/tools/share/jstack.html

Regards
Christoph

Am 03.06.2014 08:17, schrieb Jamie:
> Toke
>
> Thanks for the comment.
>
> Unfortunately, in this instance, it is a live production system, so we 
> cannot conduct experiments. The number is definitely accurate.
>
> We have many different systems with a similar load that observe the 
> same performance issue. To my knowledge, the Lucene integration code 
> is fairly well optimized.
>
> I've requested access to the indexes so that we can perform further 
> testing.
>
> Regards
>
> Jamie
>
> On 2014/06/03, 8:09 AM, Toke Eskildsen wrote:
>> On Mon, 2014-06-02 at 08:51 +0200, Jamie wrote:
>>
>> [200GB, 150M documents]
>>
>>> With NRT enabled, search speed is roughly 5 minutes on average.
>>> The server resources are:
>>> 2x6 Core Intel CPU, 128GB, 2 SSD for index and RAID 0, with Linux.
>> 5 minutes is extremely long. Is that really the right number? I do not
>> see a hardware upgrade changing that with the fine machine you're using.
>>
>> What is your search speed if you disable continuous updates?
>>
>> When you restart the searcher, how long does the first search take?
>>
>>
>> - Toke Eskildsen, State and University Library, Denmark
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance

Posted by Jamie <ja...@mailarchiva.com>.

Toke

Thanks for the contact. See below:

On 2014/06/03, 9:17 AM, Toke Eskildsen wrote:
> On Tue, 2014-06-03 at 08:17 +0200, Jamie wrote:
>> Unfortunately, in this instance, it is a live production system, so we
>> cannot conduct experiments. The number is definitely accurate.
>>
>> We have many different systems with a similar load that observe the same
>> performance issue. To my knowledge, the Lucene integration code is
>> fairly well optimized.
> It is possible that the extreme slowness is a combination of factors,
> but with a bit of luck it will boil down to a single thing. Standard
> procedure it to disable features until it performs well, so
>
> - Disable running updates
No can do.
> - Limit page size
Done this.
> - Limit lookup of returned fields
Done this.
> - Disable highlighting
No highlighting.
> - Simpler queries
They are as simple as possible.
> - Whatever else you might think of
Our application has been using Lucene for seven years. It has been 
constantly optimized over that period.

I'll conduct further testing...
>
> At some point along the way I would expect a sharp increase in
> performance.
>
>> I've requested access to the indexes so that we can perform further testing.
> Great.
>
> - Toke Eskildsen, State and University Library, Denmark
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Tue, 2014-06-03 at 08:17 +0200, Jamie wrote:
> Unfortunately, in this instance, it is a live production system, so we 
> cannot conduct experiments. The number is definitely accurate.
> 
> We have many different systems with a similar load that observe the same 
> performance issue. To my knowledge, the Lucene integration code is 
> fairly well optimized.

It is possible that the extreme slowness is a combination of factors,
but with a bit of luck it will boil down to a single thing. Standard
procedure it to disable features until it performs well, so

- Disable running updates
- Limit page size
- Limit lookup of returned fields
- Disable highlighting
- Simpler queries
- Whatever else you might think of

At some point along the way I would expect a sharp increase in
performance.

> I've requested access to the indexes so that we can perform further testing.

Great.

- Toke Eskildsen, State and University Library, Denmark



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance

Posted by Jamie <ja...@mailarchiva.com>.

Toke

Thanks for the comment.

Unfortunately, in this instance, it is a live production system, so we 
cannot conduct experiments. The number is definitely accurate.

We have many different systems with a similar load that observe the same 
performance issue. To my knowledge, the Lucene integration code is 
fairly well optimized.

I've requested access to the indexes so that we can perform further testing.

Regards

Jamie

On 2014/06/03, 8:09 AM, Toke Eskildsen wrote:
> On Mon, 2014-06-02 at 08:51 +0200, Jamie wrote:
>
> [200GB, 150M documents]
>
>> With NRT enabled, search speed is roughly 5 minutes on average.
>> The server resources are:
>> 2x6 Core Intel CPU, 128GB, 2 SSD for index and RAID 0, with Linux.
> 5 minutes is extremely long. Is that really the right number? I do not
> see a hardware upgrade changing that with the fine machine you're using.
>
> What is your search speed if you disable continuous updates?
>
> When you restart the searcher, how long does the first search take?
>
>
> - Toke Eskildsen, State and University Library, Denmark
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Mon, 2014-06-02 at 08:51 +0200, Jamie wrote:

[200GB, 150M documents]

> With NRT enabled, search speed is roughly 5 minutes on average.
> The server resources are: 
> 2x6 Core Intel CPU, 128GB, 2 SSD for index and RAID 0, with Linux.

5 minutes is extremely long. Is that really the right number? I do not
see a hardware upgrade changing that with the fine machine you're using.

What is your search speed if you disable continuous updates?

When you restart the searcher, how long does the first search take?

- Toke Eskildsen, State and University Library, Denmark

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance

Posted by Robert Muir <rc...@gmail.com>.

Check and make sure you are not opening an indexreader for every
search. Be sure you don't do that.

On Mon, Jun 2, 2014 at 2:51 AM, Jamie <ja...@mailarchiva.com> wrote:
> Greetings
>
> Despite following all the recommended optimizations (as described at
> http://wiki.apache.org/lucene-java/ImproveSearchingSpeed) , in some of our
> installations, search performance has reached the point where is it
> unacceptably slow. For instance, in one environment, the total index size is
> 200GB, with 150 million documents indexed. With NRT enabled, search speed is
> roughly 5 minutes on average. The server resources are: 2x6 Core Intel CPU,
> 128GB, 2 SSD for index and RAID 0, with Linux.
>
> The only thing we haven't yet done, is to upgrade Lucene from 4.7.x to
> 4.8.x. Is this likely to make any noticeable difference in performance?
>
> Clearly, longer term, we need to move to a distributed search model. We
> thought to take advantage of the distributed search features offered in
> Solr, however, our solution is very tightly integrated into Lucene directly
> (since Solr didn't exist when we started out). Moving to Solr now seems like
> a daunting prospect. We've also following the Katta project with interest,
> but it doesn't appear support distributed indexing, and development on it
> seems to have stalled. It would be nice if there were a distributed search
> project on the Lucene level that we could use.
>
> I realize this is a rather vague question, but are there any further
> suggestions on ways to improve search performance? We need cheap and dirty
> ideas, as well as longer term advice on a possible path forward.
>
> Much appreciate
>
> Jamie
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

search performance

Posted by Jamie <ja...@mailarchiva.com>.

Greetings

Despite following all the recommended optimizations (as described at 
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed) , in some of 
our installations, search performance has reached the point where is it 
unacceptably slow. For instance, in one environment, the total index 
size is 200GB, with 150 million documents indexed. With NRT enabled, 
search speed is roughly 5 minutes on average. The server resources are: 
2x6 Core Intel CPU, 128GB, 2 SSD for index and RAID 0, with Linux.

The only thing we haven't yet done, is to upgrade Lucene from 4.7.x to 
4.8.x. Is this likely to make any noticeable difference in performance?

Clearly, longer term, we need to move to a distributed search model. We 
thought to take advantage of the distributed search features offered in 
Solr, however, our solution is very tightly integrated into Lucene 
directly (since Solr didn't exist when we started out). Moving to Solr 
now seems like a daunting prospect. We've also following the Katta 
project with interest, but it doesn't appear support distributed 
indexing, and development on it seems to have stalled. It would be nice 
if there were a distributed search project on the Lucene level that we 
could use.

I realize this is a rather vague question, but are there any further 
suggestions on ways to improve search performance? We need cheap and 
dirty ideas, as well as longer term advice on a possible path forward.

Much appreciate

Jamie

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: NewBie To Lucene || Perfect configuration on a 64 bit server

Posted by Erick Erickson <er...@gmail.com>.

Hmmm, you might want to move this over to the Solr user's list. This list
is lucene, which doesn't have anything to do with post.jar ;)...


On Fri, May 30, 2014 at 8:25 AM, Erick Erickson <er...@gmail.com>
wrote:

> Try a cURL statement like:
>
> curl "
> http://localhost:8983/solr/update/extract?literal.id=doc33&captureAttr=true&defaultField=text"
> -F "myfile=@testRTFVarious.rtf
>
>
> first, then work up to the post.jar bits...
>
>
> Two cautions:
>
> 1> make sure to commit afterwards. Something like
>
> http://localhost:8983/solr/collection1/update?commit=true
>
> will work
>
> 2> uncomment the line:
>
> <dynamicField name="*" type="string" multiValued="true" />
>
>
> in your schema.xml (and restart solr).
>
>
> Also, track your output to see if it went through successfully.
>
>
> Best,
> Erick
>
>
> On Fri, May 30, 2014 at 7:35 AM, Shruthi <ss...@imedx.com> wrote:
>
>> Hi ,
>>
>> Finally I have been able to convince my team to go for indexing all
>> documents first and then search rather than do on the fly indexing. I have
>> set up Solr on my machine but unable to index RTF'f. I have followed the
>> tutorial..but no where RTF is mentioned..Can someone please help me..
>> I tried following options:
>> Java -Dtype=application/RTF -jar post.jar *.RTF
>> Java -Dtype=application/rtf -jar post.jar *.RTF
>> Java -Dtype=text/RTF -jar post.jar *.RTF
>>
>> Thanks,
>> Shruthi Sethi
>> SR. SOFTWARE ENGINEER
>> iMedX
>> OFFICE:
>> 033-4001-5789 ext. N/A
>> MOBILE:
>> 91-9903957546
>> EMAIL:
>> ssethi@imedx.com
>> WEB:
>> www.imedx.com
>>
>>
>>
>> -----Original Message-----
>> From: Ralf Heyde [mailto:xoodrenalin@gmx.de]
>> Sent: Tuesday, May 27, 2014 11:56 AM
>> To: java-user@lucene.apache.org
>> Subject: AW: NewBie To Lucene || Perfect configuration on a 64 bit server
>>
>> Hey,
>>
>> I have several notes about your process.
>>
>> 1st: How you select the documents you are passing to the index for further
>> searching? Maybe it is more straight forward to "find" them on you
>> programming language?
>> 2nd: Storage is cheap, buy a hard-disk and store the overall index. The
>> most
>> expensive operation is the indexing and the first read access (caching on
>> Lucene / OS level). Imagine what happens when you build the index and
>> delete
>> it afterwards just for a "simple" search operation on a subset of your
>> documents.
>>
>> Cheers, Ralf
>>
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: rulinma [mailto:rulinma@gmail.com]
>> Gesendet: Dienstag, 27. Mai 2014 03:14
>> An: java-user@lucene.apache.org
>> Betreff: RE: NewBie To Lucene || Perfect configuration on a 64 bit server
>>
>> 1000+ is solr, lucenen more fast.
>>
>>
>>
>> --
>> View this message in context:
>>
>> http://lucene.472066.n3.nabble.com/NewBie-To-Lucene-Perfect-configuration-on
>> -a-64-bit-server-tp4136871p4138215.html
>> <http://lucene.472066.n3.nabble.com/NewBie-To-Lucene-Perfect-configuration-on-a-64-bit-server-tp4136871p4138215.html>
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

Re: NewBie To Lucene || Perfect configuration on a 64 bit server

Posted by Erick Erickson <er...@gmail.com>.

Try a cURL statement like:

curl "
http://localhost:8983/solr/update/extract?literal.id=doc33&captureAttr=true&defaultField=text"
-F "myfile=@testRTFVarious.rtf


first, then work up to the post.jar bits...


Two cautions:

1> make sure to commit afterwards. Something like

http://localhost:8983/solr/collection1/update?commit=true

will work

2> uncomment the line:

<dynamicField name="*" type="string" multiValued="true" />


in your schema.xml (and restart solr).


Also, track your output to see if it went through successfully.


Best,
Erick


On Fri, May 30, 2014 at 7:35 AM, Shruthi <ss...@imedx.com> wrote:

> Hi ,
>
> Finally I have been able to convince my team to go for indexing all
> documents first and then search rather than do on the fly indexing. I have
> set up Solr on my machine but unable to index RTF'f. I have followed the
> tutorial..but no where RTF is mentioned..Can someone please help me..
> I tried following options:
> Java -Dtype=application/RTF -jar post.jar *.RTF
> Java -Dtype=application/rtf -jar post.jar *.RTF
> Java -Dtype=text/RTF -jar post.jar *.RTF
>
> Thanks,
> Shruthi Sethi
> SR. SOFTWARE ENGINEER
> iMedX
> OFFICE:
> 033-4001-5789 ext. N/A
> MOBILE:
> 91-9903957546
> EMAIL:
> ssethi@imedx.com
> WEB:
> www.imedx.com
>
>
>
> -----Original Message-----
> From: Ralf Heyde [mailto:xoodrenalin@gmx.de]
> Sent: Tuesday, May 27, 2014 11:56 AM
> To: java-user@lucene.apache.org
> Subject: AW: NewBie To Lucene || Perfect configuration on a 64 bit server
>
> Hey,
>
> I have several notes about your process.
>
> 1st: How you select the documents you are passing to the index for further
> searching? Maybe it is more straight forward to "find" them on you
> programming language?
> 2nd: Storage is cheap, buy a hard-disk and store the overall index. The
> most
> expensive operation is the indexing and the first read access (caching on
> Lucene / OS level). Imagine what happens when you build the index and
> delete
> it afterwards just for a "simple" search operation on a subset of your
> documents.
>
> Cheers, Ralf
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: rulinma [mailto:rulinma@gmail.com]
> Gesendet: Dienstag, 27. Mai 2014 03:14
> An: java-user@lucene.apache.org
> Betreff: RE: NewBie To Lucene || Perfect configuration on a 64 bit server
>
> 1000+ is solr, lucenen more fast.
>
>
>
> --
> View this message in context:
>
> http://lucene.472066.n3.nabble.com/NewBie-To-Lucene-Perfect-configuration-on
> -a-64-bit-server-tp4136871p4138215.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: NewBie To Lucene || Perfect configuration on a 64 bit server

Posted by Shruthi <ss...@imedx.com>.

Hi ,

Finally I have been able to convince my team to go for indexing all documents first and then search rather than do on the fly indexing. I have set up Solr on my machine but unable to index RTF'f. I have followed the tutorial..but no where RTF is mentioned..Can someone please help me..
I tried following options:
Java -Dtype=application/RTF -jar post.jar *.RTF
Java -Dtype=application/rtf -jar post.jar *.RTF
Java -Dtype=text/RTF -jar post.jar *.RTF

Thanks,
Shruthi Sethi
SR. SOFTWARE ENGINEER
iMedX
OFFICE:  
033-4001-5789 ext. N/A
MOBILE:  
91-9903957546
EMAIL:  
ssethi@imedx.com
WEB:  
www.imedx.com



-----Original Message-----
From: Ralf Heyde [mailto:xoodrenalin@gmx.de] 
Sent: Tuesday, May 27, 2014 11:56 AM
To: java-user@lucene.apache.org
Subject: AW: NewBie To Lucene || Perfect configuration on a 64 bit server

Hey,

I have several notes about your process.

1st: How you select the documents you are passing to the index for further
searching? Maybe it is more straight forward to "find" them on you
programming language?
2nd: Storage is cheap, buy a hard-disk and store the overall index. The most
expensive operation is the indexing and the first read access (caching on
Lucene / OS level). Imagine what happens when you build the index and delete
it afterwards just for a "simple" search operation on a subset of your
documents.

Cheers, Ralf



-----Ursprüngliche Nachricht-----
Von: rulinma [mailto:rulinma@gmail.com] 
Gesendet: Dienstag, 27. Mai 2014 03:14
An: java-user@lucene.apache.org
Betreff: RE: NewBie To Lucene || Perfect configuration on a 64 bit server

1000+ is solr, lucenen more fast.



--
View this message in context:
http://lucene.472066.n3.nabble.com/NewBie-To-Lucene-Perfect-configuration-on
-a-64-bit-server-tp4136871p4138215.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: NewBie To Lucene || Perfect configuration on a 64 bit server

Posted by Erick Erickson <er...@gmail.com>.

bq: We did a detailed analysis for each step and observed that
indexing per RTF file(i.e using path and content(with File Reader))
happened at the same millisecond and
On an average it took 95millisec for each file to get indexed and took
anywhere between 200 to 500millisec for file to get converted to text
using Aspose.

Do I misunderstand or is between 2/3s and 5/6ths of your time spent in
acquiring the text? In which ase even if you get Lucene down to 0ms to
index doc (impossible of course), you'll gain at best 33%.

Which is another argument for indexing all the docs and just using
filters or a TermsFilter as Arjen suggests.

Best,
Erick



On Mon, May 26, 2014 at 11:25 PM, Ralf Heyde <xo...@gmx.de> wrote:
> Hey,
>
> I have several notes about your process.
>
> 1st: How you select the documents you are passing to the index for further
> searching? Maybe it is more straight forward to "find" them on you
> programming language?
> 2nd: Storage is cheap, buy a hard-disk and store the overall index. The most
> expensive operation is the indexing and the first read access (caching on
> Lucene / OS level). Imagine what happens when you build the index and delete
> it afterwards just for a "simple" search operation on a subset of your
> documents.
>
> Cheers, Ralf
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: rulinma [mailto:rulinma@gmail.com]
> Gesendet: Dienstag, 27. Mai 2014 03:14
> An: java-user@lucene.apache.org
> Betreff: RE: NewBie To Lucene || Perfect configuration on a 64 bit server
>
> 1000+ is solr, lucenen more fast.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/NewBie-To-Lucene-Perfect-configuration-on
> -a-64-bit-server-tp4136871p4138215.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

AW: NewBie To Lucene || Perfect configuration on a 64 bit server

Posted by Ralf Heyde <xo...@gmx.de>.

Hey,

I have several notes about your process.

1st: How you select the documents you are passing to the index for further
searching? Maybe it is more straight forward to "find" them on you
programming language?
2nd: Storage is cheap, buy a hard-disk and store the overall index. The most
expensive operation is the indexing and the first read access (caching on
Lucene / OS level). Imagine what happens when you build the index and delete
it afterwards just for a "simple" search operation on a subset of your
documents.

Cheers, Ralf



-----Ursprüngliche Nachricht-----
Von: rulinma [mailto:rulinma@gmail.com] 
Gesendet: Dienstag, 27. Mai 2014 03:14
An: java-user@lucene.apache.org
Betreff: RE: NewBie To Lucene || Perfect configuration on a 64 bit server

1000+ is solr, lucenen more fast.



--
View this message in context:
http://lucene.472066.n3.nabble.com/NewBie-To-Lucene-Perfect-configuration-on
-a-64-bit-server-tp4136871p4138215.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: NewBie To Lucene || Perfect configuration on a 64 bit server

Posted by rulinma <ru...@gmail.com>.

1000+ is solr, lucenen more fast.



--
View this message in context: http://lucene.472066.n3.nabble.com/NewBie-To-Lucene-Perfect-configuration-on-a-64-bit-server-tp4136871p4138215.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: NewBie To Lucene || Perfect configuration on a 64 bit server

Posted by Arjen van der Meijden <ac...@tweakers.net>.

You don't need to worry about the 1024 maxBooleanClauses, just use a 
TermsFilter.

https://lucene.apache.org/core/4_8_0/queries/org/apache/lucene/queries/TermsFilter.html

I use it for a similar scenario, where we have a data structure that 
determines a subset of 1.5 million documents from outside Lucene. And to 
make the search (much) faster, I convert a list of ID's (primary key in 
the database) to a bunch of 'id:X' terms.

If you have other criteria (say some category id or other grouped 
selection) than those id's, you could index those alongside your 
documents and use TermsFilter's (and/or BooleanFilter with several other 
filters) to eventually make a pretty fast subset-selection.

It'll not be faster than having a dedicated 500-document database, but 
if you have to recreate that on the fly... I'd expect you to easily beat 
(with a few orders of magnitude) the time of the total procedure.

Best regards,

Arjen

On 26-5-2014 18:15 Erick Erickson wrote:
> bq: We don’t want to search on the complete document store
>
> Why not? Alexandre's comment is spot on. For 500 docs you could easily
> form a filter query like
> &fq=id1 OR id2 OR id3.... (solr-style, but easily done in Lucene). You
> get these IDs from the DB
> search. This will still be MUCH faster than indexing on the fly.
>
> The default maxBooleanClauses of 1024 if just a configuration problem,
> I've seen it at 10 times that.
>
> And you could cache the filter if you wanted and that fit your use case.
>
> Unless you _really_ can show that this solution is untenable, I think
> you're making this problem far
> too hard for yourself.
>
> If you insist on indexing these docs on the fly, you'll have to live
> with the performance hit. There's no
> real magic bullet to make your indexing sub-second. As others have
> said, indexing 500 docs seems
> like it shouldn't take as long as you're reporting. I personally
> suspect that your problem is
> somewhere in the acquisition phase. What happens if you just comment
> out all the
> code that actually does anything with Lucene and just go through the
> motions of getting
> the doc from the system-of-record in your code? My bet is that if you
> comment out the indexing
> part,  you'll find you spend 18 of your 20 seconds (SWAG).
>
> If my bet is correct, then there's _nothing_ you can do to make this
> case work as far as Lucene
> is concerned; Lucene had nothing to do with the speed issues, it's
> acquiring the docs in the first place.
>
> And if I'm wrong, then there's also virtually nothing you can do.
> Lucene is fast, very fast. You're
> apparently indexing things that are big/complex/whatever.
>
> Really, explain please why indexing all the docs and using a filter of
> the IDs from the DB
> won't work. This really, really smells like an XY problem and you have
> a flawed approach
> that is best scrapped.
>
> Best,
> Erick
>
>
> On Mon, May 26, 2014 at 6:08 AM, Alexandre Patry
> <al...@keatext.com> wrote:
>> On 26/05/2014 05:40, Shruthi wrote:
>>>
>>> Hi All,
>>>
>>> Thanks for the suggestions. But there is a slight difference in the
>>> requirements.
>>> 1. We don't  index/ search 10 million documents for a keyword; instead we
>>> do it on only 500 documents because we are supposed to get the final result
>>> only from the 500 set of documents.
>>> 2.We have already filtered 500 documents from the 10M+ documents based on
>>> a DB Stored Procedure which has nothing to do with any kind of search
>>> keywords .
>>> 3.Our search algorithm plays a vital role on this new set of 500
>>> documents.
>>> 4.We can't avoid on the fly indexing because the  document set to be
>>> indexed is random and is ever changing .
>>>          Although we can index the existing 10M+ docs before hand and keep
>>> ready the indexes..We don’t want to search on the complete document store.
>>> Instead we only want to search on the 500 documents got above.
>>>
>>> Is there any best alternative to this requirement?
>>
>> You could index all 10 million documents and use a custom filter[1] with
>> your queries to specify which 500 documents to look at.
>>
>> Hope this help,
>>
>> Alexandre
>>
>> [1]
>> http://lucene.apache.org/core/4_8_1/core/org/apache/lucene/search/Filter.html
>>>
>>>
>>> Thanks,
>>>
>>> Shruthi Sethi
>>> SR. SOFTWARE ENGINEER
>>> iMedX
>>> OFFICE:
>>> 033-4001-5789 ext. N/A
>>> MOBILE:
>>> 91-9903957546
>>> EMAIL:
>>> ssethi@imedx.com
>>> WEB:
>>> www.imedx.com
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: shashi.mit@gmail.com [mailto:shashi.mit@gmail.com] On Behalf Of
>>> Shashi Kant
>>> Sent: Saturday, May 24, 2014 5:55 AM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: NewBie To Lucene || Perfect configuration on a 64 bit server
>>>
>>> To 2nd  Vitaly's suggestion. You should consider using Apache Solr
>>> instead - it handles such issues OOTB .
>>>
>>>
>>> On Fri, May 23, 2014 at 7:52 PM, Vitaly Funstein <vf...@gmail.com>
>>> wrote:
>>>>
>>>> At the risk of sounding overly critical here, I would say you need to
>>>> scrap
>>>> your entire approach of building one small index per request, and just
>>>> build your entire searchable data store in Lucene/Solr. This is the
>>>> simplest and probably most maintainable and scalable solution. Even if
>>>> your
>>>> index contains 10M+ documents, returning at most 500 search results
>>>> should
>>>> be lightning fast compared to the latencies you're seeing right now. To
>>>> facilitate data export from the DB, take a look at this:
>>>> http://wiki.apache.org/solr/DataImportHandler
>>>>
>>>>
>>>> On Tue, May 20, 2014 at 7:36 AM, Shruthi <ss...@imedx.com> wrote:
>>>>
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Toke Eskildsen [mailto:te@statsbiblioteket.dk]
>>>>> Sent: Tuesday, May 20, 2014 3:48 PM
>>>>> To: java-user@lucene.apache.org
>>>>> Subject: Re: NewBie To Lucene || Perfect configuration on a 64 bit
>>>>> server
>>>>>
>>>>> On Tue, 2014-05-20 at 11:56 +0200, Shruthi wrote:
>>>>>
>>>>> Toke:
>>>>>>
>>>>>> Is 20 second an acceptable response time for your users?
>>>>>>
>>>>>> Shruthi: Its definitely not acceptable. PFA the piece of code that we
>>>>>> are using..Its taking 20seconds. That’s why I drafted this ticket to
>>>>>> see where I was going wrong.
>>>>>
>>>>> Indexing 1000 documents/sec in Lucene is quite common, so even taking
>>>>> into account large documents, 20 seconds sounds like quite a bit.
>>>>> Shruthi: I had attached the code snippet in previous mail. Do you
>>>>> suspect
>>>>> a foul play there?
>>>>>
>>>>>> Shruthi: Well,  its two stage process: Client is looking at
>>>>>> historical data based on a parameters like names, dates,MRN, fields
>>>>>> etc.. SO the query actually gets the data set fulfilling the
>>>>>> requirements
>>>>>>
>>>>>> If client is interested in doing a text search then he would pass the
>>>>>> search phrase on the result set.
>>>>>
>>>>> So it is not possible for a client to perform a broad phrase search to
>>>>> start with. And it sounds like your DB-queries are all simple matching?
>>>>> No complex joins and such? If so, this calls even more for a full
>>>>> Lucene-index solution, which handles all aspect of the search process.
>>>>> Shruthi: We call a DB stored procedure to get us the result set for
>>>>> working with..
>>>>> We will be using highlighter API and  I don’t think Memory  index can be
>>>>> used with highlighter.
>>>>>
>>>>> - Toke Eskildsen, State and University Library, Denmark
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>
>>>
>>
>>
>> --
>> Alexandre Patry, Ph.D
>> Chercheur / Researcher
>> http://KeaText.com
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: NewBie To Lucene || Perfect configuration on a 64 bit server

Posted by Shruthi <ss...@imedx.com>.

Hi Eric,

Here are a few statistics we got lately. 
It takes 11 secs on a 64 bit server using RAMDirectory impl for 500 RTF documents to get indexed on the fly.
It takes 25 seconds for the same scenario if each of the RTF is converted to text file using Aspose and then indexing the text files.
We then resorted to Apache Tika Tool Kit but somehow its failing for 1% of RTF's that we have so haven’t yet got confidence on the tool kit yet.(Those 1% of files were successfully parsed through Aspose)

We did a detailed analysis for each step and observed that indexing per RTF file(i.e using path and content(with File Reader)) happened at the same millisecond and 
On an average it took 95millisec for each file to get indexed and took anywhere between 200 to 500millisec for file to get converted to text using Aspose.

Thanks,

Shruthi Sethi
SR. SOFTWARE ENGINEER
iMedX
OFFICE:  
033-4001-5789 ext. N/A
MOBILE:  
91-9903957546
EMAIL:  
ssethi@imedx.com
WEB:  
www.imedx.com

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Monday, May 26, 2014 9:46 PM
To: java-user
Subject: Re: NewBie To Lucene || Perfect configuration on a 64 bit server

bq: We don’t want to search on the complete document store

Why not? Alexandre's comment is spot on. For 500 docs you could easily
form a filter query like
&fq=id1 OR id2 OR id3.... (solr-style, but easily done in Lucene). You
get these IDs from the DB
search. This will still be MUCH faster than indexing on the fly.

The default maxBooleanClauses of 1024 if just a configuration problem,
I've seen it at 10 times that.

And you could cache the filter if you wanted and that fit your use case.

Unless you _really_ can show that this solution is untenable, I think
you're making this problem far
too hard for yourself.

If you insist on indexing these docs on the fly, you'll have to live
with the performance hit. There's no
real magic bullet to make your indexing sub-second. As others have
said, indexing 500 docs seems
like it shouldn't take as long as you're reporting. I personally
suspect that your problem is
somewhere in the acquisition phase. What happens if you just comment
out all the
code that actually does anything with Lucene and just go through the
motions of getting
the doc from the system-of-record in your code? My bet is that if you
comment out the indexing
part,  you'll find you spend 18 of your 20 seconds (SWAG).

If my bet is correct, then there's _nothing_ you can do to make this
case work as far as Lucene
is concerned; Lucene had nothing to do with the speed issues, it's
acquiring the docs in the first place.

And if I'm wrong, then there's also virtually nothing you can do.
Lucene is fast, very fast. You're
apparently indexing things that are big/complex/whatever.

Really, explain please why indexing all the docs and using a filter of
the IDs from the DB
won't work. This really, really smells like an XY problem and you have
a flawed approach
that is best scrapped.

Best,
Erick

On Mon, May 26, 2014 at 6:08 AM, Alexandre Patry
<al...@keatext.com> wrote:
> On 26/05/2014 05:40, Shruthi wrote:
>>
>> Hi All,
>>
>> Thanks for the suggestions. But there is a slight difference in the
>> requirements.
>> 1. We don't  index/ search 10 million documents for a keyword; instead we
>> do it on only 500 documents because we are supposed to get the final result
>> only from the 500 set of documents.
>> 2.We have already filtered 500 documents from the 10M+ documents based on
>> a DB Stored Procedure which has nothing to do with any kind of search
>> keywords .
>> 3.Our search algorithm plays a vital role on this new set of 500
>> documents.
>> 4.We can't avoid on the fly indexing because the  document set to be
>> indexed is random and is ever changing .
>>         Although we can index the existing 10M+ docs before hand and keep
>> ready the indexes..We don’t want to search on the complete document store.
>> Instead we only want to search on the 500 documents got above.
>>
>> Is there any best alternative to this requirement?
>
> You could index all 10 million documents and use a custom filter[1] with
> your queries to specify which 500 documents to look at.
>
> Hope this help,
>
> Alexandre
>
> [1]
> http://lucene.apache.org/core/4_8_1/core/org/apache/lucene/search/Filter.html
>>
>>
>> Thanks,
>>
>> Shruthi Sethi
>> SR. SOFTWARE ENGINEER
>> iMedX
>> OFFICE:
>> 033-4001-5789 ext. N/A
>> MOBILE:
>> 91-9903957546
>> EMAIL:
>> ssethi@imedx.com
>> WEB:
>> www.imedx.com
>>
>>
>>
>> -----Original Message-----
>> From: shashi.mit@gmail.com [mailto:shashi.mit@gmail.com] On Behalf Of
>> Shashi Kant
>> Sent: Saturday, May 24, 2014 5:55 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: NewBie To Lucene || Perfect configuration on a 64 bit server
>>
>> To 2nd  Vitaly's suggestion. You should consider using Apache Solr
>> instead - it handles such issues OOTB .
>>
>>
>> On Fri, May 23, 2014 at 7:52 PM, Vitaly Funstein <vf...@gmail.com>
>> wrote:
>>>
>>> At the risk of sounding overly critical here, I would say you need to
>>> scrap
>>> your entire approach of building one small index per request, and just
>>> build your entire searchable data store in Lucene/Solr. This is the
>>> simplest and probably most maintainable and scalable solution. Even if
>>> your
>>> index contains 10M+ documents, returning at most 500 search results
>>> should
>>> be lightning fast compared to the latencies you're seeing right now. To
>>> facilitate data export from the DB, take a look at this:
>>> http://wiki.apache.org/solr/DataImportHandler
>>>
>>>
>>> On Tue, May 20, 2014 at 7:36 AM, Shruthi <ss...@imedx.com> wrote:
>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Toke Eskildsen [mailto:te@statsbiblioteket.dk]
>>>> Sent: Tuesday, May 20, 2014 3:48 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: NewBie To Lucene || Perfect configuration on a 64 bit
>>>> server
>>>>
>>>> On Tue, 2014-05-20 at 11:56 +0200, Shruthi wrote:
>>>>
>>>> Toke:
>>>>>
>>>>> Is 20 second an acceptable response time for your users?
>>>>>
>>>>> Shruthi: Its definitely not acceptable. PFA the piece of code that we
>>>>> are using..Its taking 20seconds. That’s why I drafted this ticket to
>>>>> see where I was going wrong.
>>>>
>>>> Indexing 1000 documents/sec in Lucene is quite common, so even taking
>>>> into account large documents, 20 seconds sounds like quite a bit.
>>>> Shruthi: I had attached the code snippet in previous mail. Do you
>>>> suspect
>>>> a foul play there?
>>>>
>>>>> Shruthi: Well,  its two stage process: Client is looking at
>>>>> historical data based on a parameters like names, dates,MRN, fields
>>>>> etc.. SO the query actually gets the data set fulfilling the
>>>>> requirements
>>>>>
>>>>> If client is interested in doing a text search then he would pass the
>>>>> search phrase on the result set.
>>>>
>>>> So it is not possible for a client to perform a broad phrase search to
>>>> start with. And it sounds like your DB-queries are all simple matching?
>>>> No complex joins and such? If so, this calls even more for a full
>>>> Lucene-index solution, which handles all aspect of the search process.
>>>> Shruthi: We call a DB stored procedure to get us the result set for
>>>> working with..
>>>> We will be using highlighter API and  I don’t think Memory  index can be
>>>> used with highlighter.
>>>>
>>>> - Toke Eskildsen, State and University Library, Denmark
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>
>>
>
>
> --
> Alexandre Patry, Ph.D
> Chercheur / Researcher
> http://KeaText.com
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: NewBie To Lucene || Perfect configuration on a 64 bit server

Posted by Erick Erickson <er...@gmail.com>.

bq: We don’t want to search on the complete document store

Why not? Alexandre's comment is spot on. For 500 docs you could easily
form a filter query like
&fq=id1 OR id2 OR id3.... (solr-style, but easily done in Lucene). You
get these IDs from the DB
search. This will still be MUCH faster than indexing on the fly.

The default maxBooleanClauses of 1024 if just a configuration problem,
I've seen it at 10 times that.

And you could cache the filter if you wanted and that fit your use case.

Unless you _really_ can show that this solution is untenable, I think
you're making this problem far
too hard for yourself.

If you insist on indexing these docs on the fly, you'll have to live
with the performance hit. There's no
real magic bullet to make your indexing sub-second. As others have
said, indexing 500 docs seems
like it shouldn't take as long as you're reporting. I personally
suspect that your problem is
somewhere in the acquisition phase. What happens if you just comment
out all the
code that actually does anything with Lucene and just go through the
motions of getting
the doc from the system-of-record in your code? My bet is that if you
comment out the indexing
part,  you'll find you spend 18 of your 20 seconds (SWAG).

If my bet is correct, then there's _nothing_ you can do to make this
case work as far as Lucene
is concerned; Lucene had nothing to do with the speed issues, it's
acquiring the docs in the first place.

And if I'm wrong, then there's also virtually nothing you can do.
Lucene is fast, very fast. You're
apparently indexing things that are big/complex/whatever.

Really, explain please why indexing all the docs and using a filter of
the IDs from the DB
won't work. This really, really smells like an XY problem and you have
a flawed approach
that is best scrapped.

Best,
Erick


On Mon, May 26, 2014 at 6:08 AM, Alexandre Patry
<al...@keatext.com> wrote:
> On 26/05/2014 05:40, Shruthi wrote:
>>
>> Hi All,
>>
>> Thanks for the suggestions. But there is a slight difference in the
>> requirements.
>> 1. We don't  index/ search 10 million documents for a keyword; instead we
>> do it on only 500 documents because we are supposed to get the final result
>> only from the 500 set of documents.
>> 2.We have already filtered 500 documents from the 10M+ documents based on
>> a DB Stored Procedure which has nothing to do with any kind of search
>> keywords .
>> 3.Our search algorithm plays a vital role on this new set of 500
>> documents.
>> 4.We can't avoid on the fly indexing because the  document set to be
>> indexed is random and is ever changing .
>>         Although we can index the existing 10M+ docs before hand and keep
>> ready the indexes..We don’t want to search on the complete document store.
>> Instead we only want to search on the 500 documents got above.
>>
>> Is there any best alternative to this requirement?
>
> You could index all 10 million documents and use a custom filter[1] with
> your queries to specify which 500 documents to look at.
>
> Hope this help,
>
> Alexandre
>
> [1]
> http://lucene.apache.org/core/4_8_1/core/org/apache/lucene/search/Filter.html
>>
>>
>> Thanks,
>>
>> Shruthi Sethi
>> SR. SOFTWARE ENGINEER
>> iMedX
>> OFFICE:
>> 033-4001-5789 ext. N/A
>> MOBILE:
>> 91-9903957546
>> EMAIL:
>> ssethi@imedx.com
>> WEB:
>> www.imedx.com
>>
>>
>>
>> -----Original Message-----
>> From: shashi.mit@gmail.com [mailto:shashi.mit@gmail.com] On Behalf Of
>> Shashi Kant
>> Sent: Saturday, May 24, 2014 5:55 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: NewBie To Lucene || Perfect configuration on a 64 bit server
>>
>> To 2nd  Vitaly's suggestion. You should consider using Apache Solr
>> instead - it handles such issues OOTB .
>>
>>
>> On Fri, May 23, 2014 at 7:52 PM, Vitaly Funstein <vf...@gmail.com>
>> wrote:
>>>
>>> At the risk of sounding overly critical here, I would say you need to
>>> scrap
>>> your entire approach of building one small index per request, and just
>>> build your entire searchable data store in Lucene/Solr. This is the
>>> simplest and probably most maintainable and scalable solution. Even if
>>> your
>>> index contains 10M+ documents, returning at most 500 search results
>>> should
>>> be lightning fast compared to the latencies you're seeing right now. To
>>> facilitate data export from the DB, take a look at this:
>>> http://wiki.apache.org/solr/DataImportHandler
>>>
>>>
>>> On Tue, May 20, 2014 at 7:36 AM, Shruthi <ss...@imedx.com> wrote:
>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Toke Eskildsen [mailto:te@statsbiblioteket.dk]
>>>> Sent: Tuesday, May 20, 2014 3:48 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: NewBie To Lucene || Perfect configuration on a 64 bit
>>>> server
>>>>
>>>> On Tue, 2014-05-20 at 11:56 +0200, Shruthi wrote:
>>>>
>>>> Toke:
>>>>>
>>>>> Is 20 second an acceptable response time for your users?
>>>>>
>>>>> Shruthi: Its definitely not acceptable. PFA the piece of code that we
>>>>> are using..Its taking 20seconds. That’s why I drafted this ticket to
>>>>> see where I was going wrong.
>>>>
>>>> Indexing 1000 documents/sec in Lucene is quite common, so even taking
>>>> into account large documents, 20 seconds sounds like quite a bit.
>>>> Shruthi: I had attached the code snippet in previous mail. Do you
>>>> suspect
>>>> a foul play there?
>>>>
>>>>> Shruthi: Well,  its two stage process: Client is looking at
>>>>> historical data based on a parameters like names, dates,MRN, fields
>>>>> etc.. SO the query actually gets the data set fulfilling the
>>>>> requirements
>>>>>
>>>>> If client is interested in doing a text search then he would pass the
>>>>> search phrase on the result set.
>>>>
>>>> So it is not possible for a client to perform a broad phrase search to
>>>> start with. And it sounds like your DB-queries are all simple matching?
>>>> No complex joins and such? If so, this calls even more for a full
>>>> Lucene-index solution, which handles all aspect of the search process.
>>>> Shruthi: We call a DB stored procedure to get us the result set for
>>>> working with..
>>>> We will be using highlighter API and  I don’t think Memory  index can be
>>>> used with highlighter.
>>>>
>>>> - Toke Eskildsen, State and University Library, Denmark
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>
>>
>
>
> --
> Alexandre Patry, Ph.D
> Chercheur / Researcher
> http://KeaText.com
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: NewBie To Lucene || Perfect configuration on a 64 bit server

Posted by Alexandre Patry <al...@keatext.com>.

On 26/05/2014 05:40, Shruthi wrote:
> Hi All,
>
> Thanks for the suggestions. But there is a slight difference in the requirements.
> 1. We don't  index/ search 10 million documents for a keyword; instead we do it on only 500 documents because we are supposed to get the final result only from the 500 set of documents.
> 2.We have already filtered 500 documents from the 10M+ documents based on a DB Stored Procedure which has nothing to do with any kind of search keywords .
> 3.Our search algorithm plays a vital role on this new set of 500 documents.
> 4.We can't avoid on the fly indexing because the  document set to be indexed is random and is ever changing .
> 	Although we can index the existing 10M+ docs before hand and keep ready the indexes..We don’t want to search on the complete document store. Instead we only want to search on the 500 documents got above.
>
> Is there any best alternative to this requirement?
You could index all 10 million documents and use a custom filter[1] with 
your queries to specify which 500 documents to look at.

Hope this help,

Alexandre

[1] 
http://lucene.apache.org/core/4_8_1/core/org/apache/lucene/search/Filter.html 

>
> Thanks,
>
> Shruthi Sethi
> SR. SOFTWARE ENGINEER
> iMedX
> OFFICE:
> 033-4001-5789 ext. N/A
> MOBILE:
> 91-9903957546
> EMAIL:
> ssethi@imedx.com
> WEB:
> www.imedx.com
>
>
>
> -----Original Message-----
> From: shashi.mit@gmail.com [mailto:shashi.mit@gmail.com] On Behalf Of Shashi Kant
> Sent: Saturday, May 24, 2014 5:55 AM
> To: java-user@lucene.apache.org
> Subject: Re: NewBie To Lucene || Perfect configuration on a 64 bit server
>
> To 2nd  Vitaly's suggestion. You should consider using Apache Solr
> instead - it handles such issues OOTB .
>
>
> On Fri, May 23, 2014 at 7:52 PM, Vitaly Funstein <vf...@gmail.com> wrote:
>> At the risk of sounding overly critical here, I would say you need to scrap
>> your entire approach of building one small index per request, and just
>> build your entire searchable data store in Lucene/Solr. This is the
>> simplest and probably most maintainable and scalable solution. Even if your
>> index contains 10M+ documents, returning at most 500 search results should
>> be lightning fast compared to the latencies you're seeing right now. To
>> facilitate data export from the DB, take a look at this:
>> http://wiki.apache.org/solr/DataImportHandler
>>
>>
>> On Tue, May 20, 2014 at 7:36 AM, Shruthi <ss...@imedx.com> wrote:
>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Toke Eskildsen [mailto:te@statsbiblioteket.dk]
>>> Sent: Tuesday, May 20, 2014 3:48 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: NewBie To Lucene || Perfect configuration on a 64 bit server
>>>
>>> On Tue, 2014-05-20 at 11:56 +0200, Shruthi wrote:
>>>
>>> Toke:
>>>> Is 20 second an acceptable response time for your users?
>>>>
>>>> Shruthi: Its definitely not acceptable. PFA the piece of code that we
>>>> are using..Its taking 20seconds. That’s why I drafted this ticket to
>>>> see where I was going wrong.
>>> Indexing 1000 documents/sec in Lucene is quite common, so even taking
>>> into account large documents, 20 seconds sounds like quite a bit.
>>> Shruthi: I had attached the code snippet in previous mail. Do you suspect
>>> a foul play there?
>>>
>>>> Shruthi: Well,  its two stage process: Client is looking at
>>>> historical data based on a parameters like names, dates,MRN, fields
>>>> etc.. SO the query actually gets the data set fulfilling the
>>>> requirements
>>>>
>>>> If client is interested in doing a text search then he would pass the
>>>> search phrase on the result set.
>>> So it is not possible for a client to perform a broad phrase search to
>>> start with. And it sounds like your DB-queries are all simple matching?
>>> No complex joins and such? If so, this calls even more for a full
>>> Lucene-index solution, which handles all aspect of the search process.
>>> Shruthi: We call a DB stored procedure to get us the result set for
>>> working with..
>>> We will be using highlighter API and  I don’t think Memory  index can be
>>> used with highlighter.
>>>
>>> - Toke Eskildsen, State and University Library, Denmark
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>
>


-- 
Alexandre Patry, Ph.D
Chercheur / Researcher
http://KeaText.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: NewBie To Lucene || Perfect configuration on a 64 bit server

Posted by Shruthi <ss...@imedx.com>.

Hi All,

Thanks for the suggestions. But there is a slight difference in the requirements.
1. We don't  index/ search 10 million documents for a keyword; instead we do it on only 500 documents because we are supposed to get the final result only from the 500 set of documents.
2.We have already filtered 500 documents from the 10M+ documents based on a DB Stored Procedure which has nothing to do with any kind of search keywords .
3.Our search algorithm plays a vital role on this new set of 500 documents.
4.We can't avoid on the fly indexing because the  document set to be indexed is random and is ever changing .
	Although we can index the existing 10M+ docs before hand and keep ready the indexes..We don’t want to search on the complete document store. Instead we only want to search on the 500 documents got above.

Is there any best alternative to this requirement?

Thanks,

Shruthi Sethi
SR. SOFTWARE ENGINEER
iMedX
OFFICE:  
033-4001-5789 ext. N/A
MOBILE:  
91-9903957546
EMAIL:  
ssethi@imedx.com
WEB:  
www.imedx.com



-----Original Message-----
From: shashi.mit@gmail.com [mailto:shashi.mit@gmail.com] On Behalf Of Shashi Kant
Sent: Saturday, May 24, 2014 5:55 AM
To: java-user@lucene.apache.org
Subject: Re: NewBie To Lucene || Perfect configuration on a 64 bit server

To 2nd  Vitaly's suggestion. You should consider using Apache Solr
instead - it handles such issues OOTB .


On Fri, May 23, 2014 at 7:52 PM, Vitaly Funstein <vf...@gmail.com> wrote:
> At the risk of sounding overly critical here, I would say you need to scrap
> your entire approach of building one small index per request, and just
> build your entire searchable data store in Lucene/Solr. This is the
> simplest and probably most maintainable and scalable solution. Even if your
> index contains 10M+ documents, returning at most 500 search results should
> be lightning fast compared to the latencies you're seeing right now. To
> facilitate data export from the DB, take a look at this:
> http://wiki.apache.org/solr/DataImportHandler
>
>
> On Tue, May 20, 2014 at 7:36 AM, Shruthi <ss...@imedx.com> wrote:
>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Toke Eskildsen [mailto:te@statsbiblioteket.dk]
>> Sent: Tuesday, May 20, 2014 3:48 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: NewBie To Lucene || Perfect configuration on a 64 bit server
>>
>> On Tue, 2014-05-20 at 11:56 +0200, Shruthi wrote:
>>
>> Toke:
>> > Is 20 second an acceptable response time for your users?
>> >
>> > Shruthi: Its definitely not acceptable. PFA the piece of code that we
>> > are using..Its taking 20seconds. That’s why I drafted this ticket to
>> > see where I was going wrong.
>>
>> Indexing 1000 documents/sec in Lucene is quite common, so even taking
>> into account large documents, 20 seconds sounds like quite a bit.
>> Shruthi: I had attached the code snippet in previous mail. Do you suspect
>> a foul play there?
>>
>> > Shruthi: Well,  its two stage process: Client is looking at
>> > historical data based on a parameters like names, dates,MRN, fields
>> > etc.. SO the query actually gets the data set fulfilling the
>> > requirements
>> >
>> > If client is interested in doing a text search then he would pass the
>> > search phrase on the result set.
>>
>> So it is not possible for a client to perform a broad phrase search to
>> start with. And it sounds like your DB-queries are all simple matching?
>> No complex joins and such? If so, this calls even more for a full
>> Lucene-index solution, which handles all aspect of the search process.
>> Shruthi: We call a DB stored procedure to get us the result set for
>> working with..
>> We will be using highlighter API and  I don’t think Memory  index can be
>> used with highlighter.
>>
>> >
>> - Toke Eskildsen, State and University Library, Denmark
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>



-- 
skant@alum.mit.edu
(617) 595-5946

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: NewBie To Lucene || Perfect configuration on a 64 bit server

Posted by Shashi Kant <sk...@sloan.mit.edu>.

To 2nd  Vitaly's suggestion. You should consider using Apache Solr
instead - it handles such issues OOTB .


On Fri, May 23, 2014 at 7:52 PM, Vitaly Funstein <vf...@gmail.com> wrote:
> At the risk of sounding overly critical here, I would say you need to scrap
> your entire approach of building one small index per request, and just
> build your entire searchable data store in Lucene/Solr. This is the
> simplest and probably most maintainable and scalable solution. Even if your
> index contains 10M+ documents, returning at most 500 search results should
> be lightning fast compared to the latencies you're seeing right now. To
> facilitate data export from the DB, take a look at this:
> http://wiki.apache.org/solr/DataImportHandler
>
>
> On Tue, May 20, 2014 at 7:36 AM, Shruthi <ss...@imedx.com> wrote:
>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Toke Eskildsen [mailto:te@statsbiblioteket.dk]
>> Sent: Tuesday, May 20, 2014 3:48 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: NewBie To Lucene || Perfect configuration on a 64 bit server
>>
>> On Tue, 2014-05-20 at 11:56 +0200, Shruthi wrote:
>>
>> Toke:
>> > Is 20 second an acceptable response time for your users?
>> >
>> > Shruthi: Its definitely not acceptable. PFA the piece of code that we
>> > are using..Its taking 20seconds. That’s why I drafted this ticket to
>> > see where I was going wrong.
>>
>> Indexing 1000 documents/sec in Lucene is quite common, so even taking
>> into account large documents, 20 seconds sounds like quite a bit.
>> Shruthi: I had attached the code snippet in previous mail. Do you suspect
>> a foul play there?
>>
>> > Shruthi: Well,  its two stage process: Client is looking at
>> > historical data based on a parameters like names, dates,MRN, fields
>> > etc.. SO the query actually gets the data set fulfilling the
>> > requirements
>> >
>> > If client is interested in doing a text search then he would pass the
>> > search phrase on the result set.
>>
>> So it is not possible for a client to perform a broad phrase search to
>> start with. And it sounds like your DB-queries are all simple matching?
>> No complex joins and such? If so, this calls even more for a full
>> Lucene-index solution, which handles all aspect of the search process.
>> Shruthi: We call a DB stored procedure to get us the result set for
>> working with..
>> We will be using highlighter API and  I don’t think Memory  index can be
>> used with highlighter.
>>
>> >
>> - Toke Eskildsen, State and University Library, Denmark
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>



-- 
skant@alum.mit.edu
(617) 595-5946

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: NewBie To Lucene || Perfect configuration on a 64 bit server

Posted by Vitaly Funstein <vf...@gmail.com>.

At the risk of sounding overly critical here, I would say you need to scrap
your entire approach of building one small index per request, and just
build your entire searchable data store in Lucene/Solr. This is the
simplest and probably most maintainable and scalable solution. Even if your
index contains 10M+ documents, returning at most 500 search results should
be lightning fast compared to the latencies you're seeing right now. To
facilitate data export from the DB, take a look at this:
http://wiki.apache.org/solr/DataImportHandler


On Tue, May 20, 2014 at 7:36 AM, Shruthi <ss...@imedx.com> wrote:

>
>
>
>
> -----Original Message-----
> From: Toke Eskildsen [mailto:te@statsbiblioteket.dk]
> Sent: Tuesday, May 20, 2014 3:48 PM
> To: java-user@lucene.apache.org
> Subject: Re: NewBie To Lucene || Perfect configuration on a 64 bit server
>
> On Tue, 2014-05-20 at 11:56 +0200, Shruthi wrote:
>
> Toke:
> > Is 20 second an acceptable response time for your users?
> >
> > Shruthi: Its definitely not acceptable. PFA the piece of code that we
> > are using..Its taking 20seconds. That’s why I drafted this ticket to
> > see where I was going wrong.
>
> Indexing 1000 documents/sec in Lucene is quite common, so even taking
> into account large documents, 20 seconds sounds like quite a bit.
> Shruthi: I had attached the code snippet in previous mail. Do you suspect
> a foul play there?
>
> > Shruthi: Well,  its two stage process: Client is looking at
> > historical data based on a parameters like names, dates,MRN, fields
> > etc.. SO the query actually gets the data set fulfilling the
> > requirements
> >
> > If client is interested in doing a text search then he would pass the
> > search phrase on the result set.
>
> So it is not possible for a client to perform a broad phrase search to
> start with. And it sounds like your DB-queries are all simple matching?
> No complex joins and such? If so, this calls even more for a full
> Lucene-index solution, which handles all aspect of the search process.
> Shruthi: We call a DB stored procedure to get us the result set for
> working with..
> We will be using highlighter API and  I don’t think Memory  index can be
> used with highlighter.
>
> >
> - Toke Eskildsen, State and University Library, Denmark
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: NewBie To Lucene || Perfect configuration on a 64 bit server

Posted by Shruthi <ss...@imedx.com>.

-----Original Message-----
From: Toke Eskildsen [mailto:te@statsbiblioteket.dk] 
Sent: Tuesday, May 20, 2014 3:48 PM
To: java-user@lucene.apache.org
Subject: Re: NewBie To Lucene || Perfect configuration on a 64 bit server

On Tue, 2014-05-20 at 11:56 +0200, Shruthi wrote:

Toke:
> Is 20 second an acceptable response time for your users? 
> 
> Shruthi: Its definitely not acceptable. PFA the piece of code that we
> are using..Its taking 20seconds. That’s why I drafted this ticket to
> see where I was going wrong.

Indexing 1000 documents/sec in Lucene is quite common, so even taking
into account large documents, 20 seconds sounds like quite a bit.
Shruthi: I had attached the code snippet in previous mail. Do you suspect a foul play there?

> Shruthi: Well,  its two stage process: Client is looking at
> historical data based on a parameters like names, dates,MRN, fields
> etc.. SO the query actually gets the data set fulfilling the
> requirements
> 
> If client is interested in doing a text search then he would pass the
> search phrase on the result set.

So it is not possible for a client to perform a broad phrase search to
start with. And it sounds like your DB-queries are all simple matching?
No complex joins and such? If so, this calls even more for a full
Lucene-index solution, which handles all aspect of the search process.
Shruthi: We call a DB stored procedure to get us the result set for working with..
We will be using highlighter API and  I don’t think Memory  index can be used with highlighter.

> 
- Toke Eskildsen, State and University Library, Denmark

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: NewBie To Lucene || Perfect configuration on a 64 bit server

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Tue, 2014-05-20 at 11:56 +0200, Shruthi wrote:

Toke:
> Is 20 second an acceptable response time for your users? 
> 
> Shruthi: Its definitely not acceptable. PFA the piece of code that we
> are using..Its taking 20seconds. That’s why I drafted this ticket to
> see where I was going wrong.

Indexing 1000 documents/sec in Lucene is quite common, so even taking
into account large documents, 20 seconds sounds like quite a bit.

> Shruthi: Well,  its two stage process: Client is looking at
> historical data based on a parameters like names, dates,MRN, fields
> etc.. SO the query actually gets the data set fulfilling the
> requirements
> 
> If client is interested in doing a text search then he would pass the
> search phrase on the result set.

So it is not possible for a client to perform a broad phrase search to
start with. And it sounds like your DB-queries are all simple matching?
No complex joins and such? If so, this calls even more for a full
Lucene-index solution, which handles all aspect of the search process.
> 
- Toke Eskildsen, State and University Library, Denmark




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: NewBie To Lucene || Perfect configuration on a 64 bit server

Posted by Shruthi <ss...@imedx.com>.

-----Original Message-----
From: Toke Eskildsen [mailto:te@statsbiblioteket.dk]
Sent: Tuesday, May 20, 2014 3:01 PM
To: java-user@lucene.apache.org
Subject: Re: NewBie To Lucene || Perfect configuration on a 64 bit server

On Tue, 2014-05-20 at 10:40 +0200, Shruthi wrote:

> Just the indexing took 20 seconds L

That's more than I expected, but it leaves the same question:

Is 20 second an acceptable response time for your users?

Shruthi: Its definitely not acceptable. PFA the piece of code that we are using..Its taking 20seconds. That’s why I drafted this ticket to see where I was going wrong.

I don't know your document size, but unless they are very large, the

response times from a full 10M document index will be way better than 20

seconds. Even on a low-RAM machine with spinning drives.

> We are yet to try on 64 bit server to check if that would change

> drastically.

I doubt it will.

Toke:

> RAMDirectory seems a better choice.

>

> Shruthi : But RAM DIrectory  has bad concurrency on multithreaded

> environments.

I assumed you would be creating a dedicated index for each request,

thereby effectively having single threaded usage for each separate

index.

Shruthi: Yes we are creating a dedicated index for each request. Ok so RAM Directory holds good for our use case then. By the way we would be using the

Highlighter APi also..we just found out that using that API increased the index size by 4 times.

I just remembered that Lucene has an implementation dedicated to fast

indexing. Take a look at

http://lucene.apache.org/core/4_8_0/memory/org/apache/lucene/index/memory/MemoryIndex.html

It seems like just the thing for your use case.

Shruthi: Thank you will definetly try this..

> Shruthi : The same user from the same client will not be searching for

> same phrase again unless he has amnesia. This was already discussed

> with our architects.

If your architects base their decisions on observed user behaviour, then

fine. At our library, many users refines their queries, meaning that a

common pattern is 2-4 queries that are very much alike.

Shruthi : I will put forward this approach. We search medical transcripts and most of the time users search for drug names. I’m not sure if we can generalize this query.

> Shruthi:  Actually we have a DB query that runs prior to indexing

> which fetches max. 500 docs from 10million+ docs in NASSHARE. We then

> have to apply search phrase only on the resultant set..So this way

>

> The set is just limited to 500 -1000.

Frankly, the combination of a pre-selection with a DB query and the

addon of heavy index + search with Lucene seems like the absolute worst

of both worlds.

Does the DB-selector do anything that cannot easily be replicated in

Lucene?

Shruthi: Well,  its two stage process: Client is looking at  historical data based on a parameters like names, dates,MRN, fields etc.. SO the query actually gets the data set fulfilling the requirements

If client is interested in doing a text search then he would pass the search phrase on the result set.

- Toke Eskildsen, State and University Library, Denmark

---------------------------------------------------------------------

To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org

For additional commands, e-mail: java-user-help@lucene.apache.org

Re: NewBie To Lucene || Perfect configuration on a 64 bit server

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Tue, 2014-05-20 at 10:40 +0200, Shruthi wrote:
> Just the indexing took 20 seconds L

That's more than I expected, but it leaves the same question:
Is 20 second an acceptable response time for your users?

I don't know your document size, but unless they are very large, the
response times from a full 10M document index will be way better than 20
seconds. Even on a low-RAM machine with spinning drives.

> We are yet to try on 64 bit server to check if that would change
> drastically.

I doubt it will.

Toke:
> RAMDirectory seems a better choice.
> 
> Shruthi : But RAM DIrectory  has bad concurrency on multithreaded
> environments.

I assumed you would be creating a dedicated index for each request,
thereby effectively having single threaded usage for each separate
index.

I just remembered that Lucene has an implementation dedicated to fast
indexing. Take a look at
http://lucene.apache.org/core/4_8_0/memory/org/apache/lucene/index/memory/MemoryIndex.html
It seems like just the thing for your use case.

> Shruthi : The same user from the same client will not be searching for
> same phrase again unless he has amnesia. This was already discussed
> with our architects.

If your architects base their decisions on observed user behaviour, then
fine. At our library, many users refines their queries, meaning that a
common pattern is 2-4 queries that are very much alike.

> Shruthi:  Actually we have a DB query that runs prior to indexing
> which fetches max. 500 docs from 10million+ docs in NASSHARE. We then
> have to apply search phrase only on the resultant set..So this way
> 
> The set is just limited to 500 -1000.

Frankly, the combination of a pre-selection with a DB query and the
addon of heavy index + search with Lucene seems like the absolute worst
of both worlds.

Does the DB-selector do anything that cannot easily be replicated in
Lucene?

- Toke Eskildsen, State and University Library, Denmark

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: NewBie To Lucene || Perfect configuration on a 64 bit server

Posted by Shruthi <ss...@imedx.com>.

-----Original Message-----
From: Toke Eskildsen [mailto:te@statsbiblioteket.dk]
Sent: Tuesday, May 20, 2014 12:57 PM
To: java-user@lucene.apache.org
Subject: Re: NewBie To Lucene || Perfect configuration on a 64 bit server

On Mon, 2014-05-19 at 12:40 +0200, Shruthi wrote:

> 1.       Client makes a request with a search phrase. Lucene

> application indexes a list of 500 documents(at max. ) and searches the

> phrase on the index constructed.

Fetching from NAS + indexing sounds like something that would take a

second or two. Have you tried this?

Shruthi : We haven’t yet tried from NAS..but we kept local storage of 500 documents(all are RTF’s so we used Aspose to convert to text before indexing) and on a 4GB machine , with RAM director implementation

Just the indexing took 20 seconds ☹

We are yet to try on 64 bit server to check if that would change drastically.

> We have decided to use MMapDirectory for above requirement.

As your index data are extremely transient and the datasets small,

RAMDirectory seems a better choice.

Shruthi : But RAM DIrectory  has bad concurrency on multithreaded environments.

You state that you delete the index when the search has finished.

Wouldn't it be better to keep it a couple of minutes? That way further

searches from the same client would be fast.

Shruthi : The same user from the same client will not be searching for same phrase again unless he has amnesia. This was already discussed with our architects.I did not have any selling point on Lucene in this aspect.

Overall, I worry about your architecture. It scales badly with the

number of documents/client. You might not have any clients with more

than 500 documents right now, but can you be sure that this will not

change?

Shruthi:  Actually we have a DB query that runs prior to indexing which fetches max. 500 docs from 10million+ docs in NASSHARE. We then have to apply search phrase only on the resultant set..So this way

The set is just limited to 500 -1000.

Thanks a lot for taking interest. Wish to hear more from you.

- Toke Eskildsen, State and University Library, Denmark

---------------------------------------------------------------------

To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org

For additional commands, e-mail: java-user-help@lucene.apache.org

Re: NewBie To Lucene || Perfect configuration on a 64 bit server

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Mon, 2014-05-19 at 12:40 +0200, Shruthi wrote:
> 1.       Client makes a request with a search phrase. Lucene
> application indexes a list of 500 documents(at max. ) and searches the
> phrase on the index constructed.

Fetching from NAS + indexing sounds like something that would take a
second or two. Have you tried this?

> We have decided to use MMapDirectory for above requirement.

As your index data are extremely transient and the datasets small,
RAMDirectory seems a better choice.

You state that you delete the index when the search has finished.
Wouldn't it be better to keep it a couple of minutes? That way further
searches from the same client would be fast.

Overall, I worry about your architecture. It scales badly with the
number of documents/client. You might not have any clients with more
than 500 documents right now, but can you be sure that this will not
change?

- Toke Eskildsen, State and University Library, Denmark

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org