You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by apoorv gupta <ap...@gmail.com> on 2016/06/09 07:11:47 UTC

Optimizing Lucene search for whitespace analyzer.

Hi

I am using lucene based index for solving following problem -

1. I have a doc with following structure:

    docName:<something>>
    includeKeywords: Space separated set of keywords.
    excludeKeywords: Space separated set of keywords.
    result: to be returned as response
    ...
    ..

2. Now I will receive a set of keywords in request and I have to find
    all the documents whose all include keywords (space separated in
document)
    are in the request and none of the exclude keywords (space separated in
document)
    are in the request.

Example:
Doc: docName:"xyz"
        includeKeywords:["ABC" , "AB" , "XYZ" , "Z"]
        excludeKeywords:["KL"]

Requests that will match the doc: 1. [ "ABC", "AB", "XYZ", "Z", "OP", "QR" ]
                                                      2. [ "ABC", "AB",
"XYZ", "Z"]

Requests that will not match the doc: 1. [ "ABC", "AB", "XYZ"]
                                                            2. [ "ABC",
"AB", "XYZ", "Z", "KL"]
                                                            3. [ "ABC",
"AB", "XYZ", "Z", "OP", "QR", "KL" ]

I have used Whitespace analyzer for creating indexes as it will properly
tokenize my space
separated keywords in include and exclude keywords list.

Also, in search I am using Boolean query by combining MatchAllDocs query
and other boolean query's with MUST_NOT_OCCUR clause.

Now with 120000 such documents indexed in lucene, I am getting a response
time of
7 ms on a machine with following configuration -

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             8
CPU MHz:               2294.686
BogoMIPS:              4589.37
Virtualization type:   full
L2 cache:              256K
L3 cache:              25600K

I need help to optimize this situation, as I feel that 7ms for such simple
use case is too much.

Is there a way I can optimize lucene search for this case, and get down my
response time ?
Please help me here.

Also, please let me know if there is a confusion in understanding of the
problem. I will re explain it with more examples.
Thanks in advance.

Regards
-Apurv