You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Dawid Weiss (JIRA)" <ji...@apache.org> on 2017/01/17 13:11:26 UTC

[jira] [Commented] (SOLR-9974) Use Suffix Arrays for fast search with leading asterisks

    [ https://issues.apache.org/jira/browse/SOLR-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826021#comment-15826021 ] 

Dawid Weiss commented on SOLR-9974:
-----------------------------------

Wouldn't it be better to create an fst for all reversed forms of all terms and lookup the leading wildcard's matches by its concrete suffix instead (you'd need to reverse it too)? This wouldn't work for (single) infix wildcards, but these could then be implemented as an intersection between the two (terms matching the leading/trailing set).

Also, there are better (as in: linear) ways of constructing suffix arrays compared to sorting. We used to implement a few of them, if you wanted to reuse the code (it should be optimized for strings/ byte sequences though) [1].

[1] https://github.com/carrotsearch/jsuffixarrays

> Use Suffix Arrays for fast search with leading asterisks
> --------------------------------------------------------
>
>                 Key: SOLR-9974
>                 URL: https://issues.apache.org/jira/browse/SOLR-9974
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Yakov Sirotkin
>         Attachments: suffix-array.patch
>
>
> If query term starts with asterisks FST checks all words in the dictionary so request processing speed falls down. This problem can be solved with Suffix Array approach. Luckily, Suffix Array can be constructed after Lucene start from existing index. Unfortunately, Suffix Arrays requires a lot of RAM so we can use it only when special flag is set:
> -Dsolr.suffixArray.enable=true
> It is possible to  speed up Suffix Array initialization using several threads, so we can control number of threads with 
> -Dsolr.suffixArray.initialization_treads_count=5
> This system property can be omitted, the default value is 5.  
> Attached patch is the suggested implementation for SuffixArray support, it works for all terms starting with asterisks with at least 3 consequent non-wildcard characters. This patch do not change search results and  affects only performance issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org