You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Boon Low <bo...@dctfh.com> on 2014/12/30 18:24:18 UTC

Re: Suggester: weight (term frequency) and 'mm' feasibility (allTermsRequired)

Hi,

Re. AND/OR boolean lookup for ‘infix’ suggestion. I checked that Lucene does have an underlying support for this via the “allTermsRequired” boolean. However this feature, along with highlighting (on/off) are currently hardwired in Lucene, and hidden in Solr.

This issue has previously been reported:
https://issues.apache.org/jira/browse/SOLR-6648

I have created a patch for both of the infix suggesters, so that suggestion highlighting and lookup behaviour can be configured in solrconfig.xml.

<lst name="suggester”>
<str name="name”>..</str>
<str name="lookupImpl">BlendedInfixLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field”>..</str>
..
<str name="allTermsRequired">false</str>
<str name="highlighting">true</str>
</lst>

Boon

-----
Boon Low
Search Engineer / Lead Big Data
DCT Family History
http://uk.linkedin.com/in/boonlow/

On 10 Dec 2014, at 21:24, Boon Low <bo...@dctfh.com>> wrote:

Hi,

Solr suggester is wonderful. We have been testing the built-in dictionary implementations for some large-ish datasets (36m, 132m), and getting single/teen milli-seconds response times with 9 multiple dictionaries per request. Most of the resulting dictionaries have millions entries too. Intrigued with the "finite-state machines" in the prefix/fuzzy suggesters too. Can’t wait to load test this properly.

Now I have some questions:

1. Term frequency, weight/count
The suggesters derive suggestions from a field in the index. What’s the feasibility of creating a custom dictionary that can automatically populate the weight/count field using term frequency (tf) during build time?

Autosuggest in most cases, is ranked by popularity. “Apache Solr — 3” (say the term occurred 3 times in a field). Why must we specific a weight field explicitly for popularity ranking while tf data is readily available from the index?

In our tests, we had to index the datasets twice. In the second pass, tf is looked up per doc via Solr’s term component, and coded in a weight field (for the suggester). The lookup is also necessary for each dictionary field. We used 4 fields for 9 dictionaries. This really provides extra “incentive” to do things in parallel!

2. "mm” lookup
The lookup logic for multi-terms is currently “AND boolean”, i.e. the suggestion must matched all terms in suggest.q. However, in our use case, we need “OR boolean” for one dictionary. This is a bit like "suggest.q infix”, e.g.:

suggest.q:
Apache Sol
suggestions:
Apache Solr
Solr
SolrCloud
We love Apache

I couldn’t get any of the existing lookup impl to find the last three suggestions. Perhaps it’s time to dig into the codes to see if a “minumum should match” mm (50% in the above case) feature is a possibility?

Thanks,

Boon

-----
Boon Low
Search Engineer / Lead Big Data
DCT Family History
http://uk.linkedin.com/in/boonlow/

________________________________
This message is confidential and may contain privileged information. You should not disclose its contents to any other person. If you are not the intended recipient, please notify the sender named above immediately. It is expressly declared that this e-mail does not constitute nor form part of a contract or unilateral obligation. Opinions, conclusions and other information in this message that do not relate to the official business of D.C. Thomson Family History shall be understood as neither given nor endorsed by it.
________________________________

__________________________________________________________________________

This email has been checked for virus and other malicious content prior to leaving our network.
__________________________________________________________________________