You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Darko Todoric <to...@mdpi.com> on 2017/08/25 15:49:05 UTC

Search by similarity?

Hi,


I have 90.000.000 documents in Solr and I need to compare "title" of 
this document and get all documents with more than 80% similarity. PHP 
have "similar_text" but it's not so smart inserting 90m documents in the 
array...
Can I do some query in Solr which will give me the more the 80% similarity?


Kind regards,
Darko Todoric

-- 
Darko Todoric
Web Engineer, MDPI DOO
Veljka Dugosevica 54, 11060 Belgrade, Serbia
+381 65 43 90 620
www.mdpi.com

Disclaimer: The information and files contained in this message are confidential
and intended solely for the use of the individual or entity to whom they are addressed.
f you have received this message in error, please notify me and delete this message from your system.
You may not copy this message in its entirety or in part, or disclose its contents to anyone.

Re: Search by similarity?

Posted by Josh Lincoln <jo...@gmail.com>.

I reviewed the dismax docs and it doesn't support the fieldname:term
portion of the lucene syntax.
To restrict a search to a field and use mm you can either
A) use edismax exactly as you're currently trying to use dismax
B) use dismax, with the following changes
* remove the title: portion of the query and just pass
q="title-123123123-end"
* set qf=title

On Tue, Aug 29, 2017 at 10:25 AM Josh Lincoln <jo...@gmail.com>
wrote:

> Darko,
> Can you use edismax instead?
>
> When using dismax, solr is parsing the title field as if it's a query
> term. E.g. the query seems to be interpreted as
> title "title-123123123-end"
> (note the lack of a colon)...which results in querying all your qf fields
> for both "title" and "title-123123123-end"
> I haven't used dismax in a very long time, so I don't know if this is
> intentional, but it's not what I expected.
>
> I'm able to reproduce the issue in 6.4.2 using the default techproducts
> Notice that in the below the parsedquery expands to both text:title and
> text:name (df=text)
> http://localhost:8983/solr/techproducts/select?indent=on&q=title
> :"name"&wt=json&debug=true&defType=dismax
> rawquerystring: "title:"name"",
> querystring: "title:"name"",
> parsedquery: "(+(DisjunctionMaxQuery(((text:title)^1.0))
> DisjunctionMaxQuery(((text:name)^1.0))) ())/no_coord",
> parsedquery_toString: "+(((text:title)^1.0) ((text:name)^1.0)) ()"
>
> But it's not an issue if you use edismax
> http://localhost:8983/solr/techproducts/select?indent=on&q=title
> :"name"&wt=json&debug=true&defType=edismax
> rawquerystring: "title:"name"",
> querystring: "title:"name"",
> parsedquery: "(+title:name)/no_coord",
> parsedquery_toString: "+title:name",
>
>
>
> On Tue, Aug 29, 2017 at 8:44 AM Darko Todoric <to...@mdpi.com> wrote:
>
>> Hi Erick,
>>
>> "debug":{ "rawquerystring":"title:\"title-123123123-end\"",
>> "querystring":"title:\"title-123123123-end\"",
>> "parsedquery":"(+(DisjunctionMaxQuery(((author_full:title)^7.0 |
>> (abstract:titl)^2.0 | (title:titl)^3.0 | (keywords:titl)^5.0 |
>> (authors:title)^4.0 | (doi:title:)^1.0))
>> DisjunctionMaxQuery(((author_full:\"title 123123123 end\"~1)^7.0 |
>> (abstract:\"titl 123123123 end\"~1)^2.0 | (title:\"titl 123123123
>> end\"~1)^3.0 | (keywords:\"titl 123123123 end\"~1)^5.0 |
>> (authors:\"title 123123123 end\"~1)^4.0 |
>> (doi:title-123123123-end)^1.0)))~1 ())/no_coord",
>> "parsedquery_toString":"+((((author_full:title)^7.0 |
>> (abstract:titl)^2.0 | (title:titl)^3.0 | (keywords:titl)^5.0 |
>> (authors:title)^4.0 | (doi:title:)^1.0) ((author_full:\"title 123123123
>> end\"~1)^7.0 | (abstract:\"titl 123123123 end\"~1)^2.0 | (title:\"titl
>> 123123123 end\"~1)^3.0 | (keywords:\"titl 123123123 end\"~1)^5.0 |
>> (authors:\"title 123123123 end\"~1)^4.0 |
>> (doi:title-123123123-end)^1.0))~1) ()", "explain":{ "23251":"\n16.848969
>> = sum of:\n 16.848969 = sum of:\n 16.848969 = max of:\n 16.848969 =
>> weight(abstract:titl in 23194) [], result of:\n 16.848969 =
>> score(doc=23194,freq=1.0 = termFreq=1.0\n), product of:\n 2.0 = boost\n
>> 5.503748 = idf(docFreq=74, docCount=18297)\n 1.5306814 = tfNorm,
>> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
>> parameter b\n 186.49593 = avgFieldLength\n 28.444445 = fieldLength\n
>> 3.816711E-5 = weight(title:titl in 23194) [], result of:\n 3.816711E-5 =
>> score(doc=23194,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
>> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
>> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
>> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
>> "20495":"\n16.169483 = sum of:\n 16.169483 = sum of:\n 16.169483 = max
>> of:\n 16.169483 = weight(abstract:titl in 20489) [], result of:\n
>> 16.169483 = score(doc=20489,freq=1.0 = termFreq=1.0\n), product of:\n
>> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.468952 =
>> tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
>> = parameter b\n 186.49593 = avgFieldLength\n 40.96 = fieldLength\n
>> 3.816711E-5 = weight(title:titl in 20489) [], result of:\n 3.816711E-5 =
>> score(doc=20489,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
>> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
>> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
>> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
>> "28227":"\n15.670726 = sum of:\n 15.670726 = sum of:\n 15.670726 = max
>> of:\n 15.670726 = weight(abstract:titl in 28156) [], result of:\n
>> 15.670726 = score(doc=28156,freq=2.0 = termFreq=2.0\n), product of:\n
>> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.4236413 =
>> tfNorm, computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.75
>> = parameter b\n 186.49593 = avgFieldLength\n 163.84 = fieldLength\n
>> 3.816711E-5 = weight(title:titl in 28156) [], result of:\n 3.816711E-5 =
>> score(doc=28156,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
>> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
>> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
>> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
>> "20375":"\n15.052014 = sum of:\n 15.052014 = sum of:\n 15.052014 = max
>> of:\n 15.052014 = weight(abstract:titl in 20369) [], result of:\n
>> 15.052014 = score(doc=20369,freq=1.0 = termFreq=1.0\n), product of:\n
>> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.3674331 =
>> tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
>> = parameter b\n 186.49593 = avgFieldLength\n 64.0 = fieldLength\n
>> 3.816711E-5 = weight(title:titl in 20369) [], result of:\n 3.816711E-5 =
>> score(doc=20369,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
>> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
>> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
>> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
>> "20381":"\n15.052014 = sum of:\n 15.052014 = sum of:\n 15.052014 = max
>> of:\n 15.052014 = weight(abstract:titl in 20375) [], result of:\n
>> 15.052014 = score(doc=20375,freq=1.0 = termFreq=1.0\n), product of:\n
>> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.3674331 =
>> tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
>> = parameter b\n 186.49593 = avgFieldLength\n 64.0 = fieldLength\n
>> 3.816711E-5 = weight(title:titl in 20375) [], result of:\n 3.816711E-5 =
>> score(doc=20375,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
>> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
>> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
>> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
>> "29030":"\n13.699375 = sum of:\n 13.699375 = sum of:\n 13.699375 = max
>> of:\n 13.699375 = weight(abstract:titl in 28959) [], result of:\n
>> 13.699375 = score(doc=28959,freq=2.0 = termFreq=2.0\n), product of:\n
>> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.2445496 =
>> tfNorm, computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.75
>> = parameter b\n 186.49593 = avgFieldLength\n 256.0 = fieldLength\n
>> 3.816711E-5 = weight(title:titl in 28959) [], result of:\n 3.816711E-5 =
>> score(doc=28959,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
>> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
>> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
>> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
>> "31444":"\n13.699375 = sum of:\n 13.699375 = sum of:\n 13.699375 = max
>> of:\n 13.699375 = weight(abstract:titl in 31373) [], result of:\n
>> 13.699375 = score(doc=31373,freq=2.0 = termFreq=2.0\n), product of:\n
>> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.2445496 =
>> tfNorm, computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.75
>> = parameter b\n 186.49593 = avgFieldLength\n 256.0 = fieldLength\n
>> 3.816711E-5 = weight(title:titl in 31373) [], result of:\n 3.816711E-5 =
>> score(doc=31373,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
>> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
>> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
>> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
>> "30621":"\n13.096554 = sum of:\n 13.096554 = sum of:\n 13.096554 = max
>> of:\n 13.096554 = weight(abstract:titl in 30550) [], result of:\n
>> 13.096554 = score(doc=30550,freq=1.0 = termFreq=1.0\n), product of:\n
>> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.189785 =
>> tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
>> = parameter b\n 186.49593 = avgFieldLength\n 113.77778 = fieldLength\n
>> 3.816711E-5 = weight(title:titl in 30550) [], result of:\n 3.816711E-5 =
>> score(doc=30550,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
>> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
>> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
>> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
>> "32067":"\n13.096554 = sum of:\n 13.096554 = sum of:\n 13.096554 = max
>> of:\n 13.096554 = weight(abstract:titl in 31996) [], result of:\n
>> 13.096554 = score(doc=31996,freq=1.0 = termFreq=1.0\n), product of:\n
>> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.189785 =
>> tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
>> = parameter b\n 186.49593 = avgFieldLength\n 113.77778 = fieldLength\n
>> 3.816711E-5 = weight(title:titl in 31996) [], result of:\n 3.816711E-5 =
>> score(doc=31996,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
>> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
>> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
>> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
>> "1935":"\n11.583146 = sum of:\n 11.583146 = sum of:\n 11.583146 = max
>> of:\n 11.583146 = weight(abstract:titl in 1934) [], result of:\n
>> 11.583146 = score(doc=1934,freq=1.0 = termFreq=1.0\n), product of:\n 2.0
>> = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.0522962 =
>> tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
>> = parameter b\n 186.49593 = avgFieldLength\n 163.84 = fieldLength\n
>> 3.816711E-5 = weight(title:titl in 1934) [], result of:\n 3.816711E-5 =
>> score(doc=1934,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
>> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
>> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
>> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n"},
>> "QParser":"DisMaxQParser", "altquerystring":null, "boostfuncs":null,
>>
>> Kind regards,
>> Darko Todoric
>>
>> On 08/28/2017 06:35 PM, Erick Erickson wrote:
>> > What are the results of adding &debug=query to the URL? The parsed
>> > query will be especially illuminating.
>> >
>> > Best,
>> > Erick
>> >
>> > On Mon, Aug 28, 2017 at 4:37 AM, Emir Arnautovic
>> > <em...@sematext.com> wrote:
>> >> Hi Darko,
>> >>
>> >> The issue is the wrong expectations: title-1-end is parsed to 3 tokens
>> >> (guessing) and mm=99% of 3 tokens is 2.99 and it is rounded down to 2.
>> Since
>> >> all your documents have 'title' and 'end' tokens, all match. If you
>> want to
>> >> round up, you can use mm=-1% - that will result in zero (or one match
>> if you
>> >> do not filter out original document).
>> >>
>> >> You have to play with your tokenizers and define what is similarity
>> match
>> >> percentage (if you want to stick with mm).
>> >>
>> >> Regards,
>> >> Emir
>> >>
>> >>
>> >>
>> >> On 28.08.2017 09:17, Darko Todoric wrote:
>> >>> Hm... I cannot make that this DisMax work on my Solr...
>> >>>
>> >>> In solr I have document with title:
>> >>>   - "title-1-end"
>> >>>   - "title-2-end"
>> >>>   - "title-3-end"
>> >>>   - ...
>> >>>   - ...
>> >>>   - "title-312-end"
>> >>>
>> >>> and when I make query
>> >>> "*
>> http://localhost:8983/solr/SciLit/select?defType=dismax&indent=on&mm=99%&q=title
>> :"title-123123123-end"&wt=json*'
>> >>> I get all documents from solr :\
>> >>> What I doing wrong?
>> >>>
>> >>> Also, I don't know if affecting results, but on "title" field I use
>> >>> "WhitespaceTokenizerFactory".
>> >>>
>> >>> Kind regards,
>> >>> Darko
>> >>>
>> >>>
>> >>> On 08/25/2017 06:38 PM, Junte Zhang wrote:
>> >>>> If you already have the title of the document, then you could run
>> that
>> >>>> title as a new query against the whole index and exclude the source
>> document
>> >>>> from the results as a filter.
>> >>>>
>> >>>> You could use the DisMax query parser:
>> >>>>
>> https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser
>> >>>>
>> >>>> And then set the minimum match ratio of the OR clauses to 90%.
>> >>>>
>> >>>> /JZ
>> >>>>
>> >>>> -----Original Message-----
>> >>>> From: Darko Todoric [mailto:todoric@mdpi.com]
>> >>>> Sent: Friday, August 25, 2017 5:49 PM
>> >>>> To: solr-user@lucene.apache.org
>> >>>> Subject: Search by similarity?
>> >>>>
>> >>>> Hi,
>> >>>>
>> >>>>
>> >>>> I have 90.000.000 documents in Solr and I need to compare "title" of
>> this
>> >>>> document and get all documents with more than 80% similarity. PHP
>> have
>> >>>> "similar_text" but it's not so smart inserting 90m documents in the
>> array...
>> >>>> Can I do some query in Solr which will give me the more the 80%
>> >>>> similarity?
>> >>>>
>> >>>>
>> >>>> Kind regards,
>> >>>> Darko Todoric
>> >>>>
>> >>>> --
>> >>>> Darko Todoric
>> >>>> Web Engineer, MDPI DOO
>> >>>> Veljka Dugosevica 54, 11060 Belgrade, Serbia
>> >>>> +381 65 43 90 620
>> >>>> www.mdpi.com
>> >>>>
>> >>>> Disclaimer: The information and files contained in this message are
>> >>>> confidential and intended solely for the use of the individual or
>> entity to
>> >>>> whom they are addressed.
>> >>>> f you have received this message in error, please notify me and
>> delete
>> >>>> this message from your system.
>> >>>> You may not copy this message in its entirety or in part, or
>> disclose its
>> >>>> contents to anyone.
>> >>>>
>> >> --
>> >> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>> >> Solr & Elasticsearch Support * http://sematext.com/
>> >>
>>
>> --
>> Darko Todoric
>> Web Engineer, MDPI DOO
>> Veljka Dugosevica 54, 11060 Belgrade, Serbia
>> +381 65 43 90 620
>> www.mdpi.com
>>
>> Disclaimer: The information and files contained in this message are
>> confidential
>> and intended solely for the use of the individual or entity to whom they
>> are addressed.
>> f you have received this message in error, please notify me and delete
>> this message from your system.
>> You may not copy this message in its entirety or in part, or disclose its
>> contents to anyone.
>>
>>

Re: Search by similarity?

Posted by Josh Lincoln <jo...@gmail.com>.

Darko,
Can you use edismax instead?

When using dismax, solr is parsing the title field as if it's a query term.
E.g. the query seems to be interpreted as
title "title-123123123-end"
(note the lack of a colon)...which results in querying all your qf fields
for both "title" and "title-123123123-end"
I haven't used dismax in a very long time, so I don't know if this is
intentional, but it's not what I expected.

I'm able to reproduce the issue in 6.4.2 using the default techproducts
Notice that in the below the parsedquery expands to both text:title and
text:name (df=text)
http://localhost:8983/solr/techproducts/select?indent=on&q=title
:"name"&wt=json&debug=true&defType=dismax
rawquerystring: "title:"name"",
querystring: "title:"name"",
parsedquery: "(+(DisjunctionMaxQuery(((text:title)^1.0))
DisjunctionMaxQuery(((text:name)^1.0))) ())/no_coord",
parsedquery_toString: "+(((text:title)^1.0) ((text:name)^1.0)) ()"

But it's not an issue if you use edismax
http://localhost:8983/solr/techproducts/select?indent=on&q=title
:"name"&wt=json&debug=true&defType=edismax
rawquerystring: "title:"name"",
querystring: "title:"name"",
parsedquery: "(+title:name)/no_coord",
parsedquery_toString: "+title:name",



On Tue, Aug 29, 2017 at 8:44 AM Darko Todoric <to...@mdpi.com> wrote:

> Hi Erick,
>
> "debug":{ "rawquerystring":"title:\"title-123123123-end\"",
> "querystring":"title:\"title-123123123-end\"",
> "parsedquery":"(+(DisjunctionMaxQuery(((author_full:title)^7.0 |
> (abstract:titl)^2.0 | (title:titl)^3.0 | (keywords:titl)^5.0 |
> (authors:title)^4.0 | (doi:title:)^1.0))
> DisjunctionMaxQuery(((author_full:\"title 123123123 end\"~1)^7.0 |
> (abstract:\"titl 123123123 end\"~1)^2.0 | (title:\"titl 123123123
> end\"~1)^3.0 | (keywords:\"titl 123123123 end\"~1)^5.0 |
> (authors:\"title 123123123 end\"~1)^4.0 |
> (doi:title-123123123-end)^1.0)))~1 ())/no_coord",
> "parsedquery_toString":"+((((author_full:title)^7.0 |
> (abstract:titl)^2.0 | (title:titl)^3.0 | (keywords:titl)^5.0 |
> (authors:title)^4.0 | (doi:title:)^1.0) ((author_full:\"title 123123123
> end\"~1)^7.0 | (abstract:\"titl 123123123 end\"~1)^2.0 | (title:\"titl
> 123123123 end\"~1)^3.0 | (keywords:\"titl 123123123 end\"~1)^5.0 |
> (authors:\"title 123123123 end\"~1)^4.0 |
> (doi:title-123123123-end)^1.0))~1) ()", "explain":{ "23251":"\n16.848969
> = sum of:\n 16.848969 = sum of:\n 16.848969 = max of:\n 16.848969 =
> weight(abstract:titl in 23194) [], result of:\n 16.848969 =
> score(doc=23194,freq=1.0 = termFreq=1.0\n), product of:\n 2.0 = boost\n
> 5.503748 = idf(docFreq=74, docCount=18297)\n 1.5306814 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 186.49593 = avgFieldLength\n 28.444445 = fieldLength\n
> 3.816711E-5 = weight(title:titl in 23194) [], result of:\n 3.816711E-5 =
> score(doc=23194,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
> "20495":"\n16.169483 = sum of:\n 16.169483 = sum of:\n 16.169483 = max
> of:\n 16.169483 = weight(abstract:titl in 20489) [], result of:\n
> 16.169483 = score(doc=20489,freq=1.0 = termFreq=1.0\n), product of:\n
> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.468952 =
> tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
> = parameter b\n 186.49593 = avgFieldLength\n 40.96 = fieldLength\n
> 3.816711E-5 = weight(title:titl in 20489) [], result of:\n 3.816711E-5 =
> score(doc=20489,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
> "28227":"\n15.670726 = sum of:\n 15.670726 = sum of:\n 15.670726 = max
> of:\n 15.670726 = weight(abstract:titl in 28156) [], result of:\n
> 15.670726 = score(doc=28156,freq=2.0 = termFreq=2.0\n), product of:\n
> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.4236413 =
> tfNorm, computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.75
> = parameter b\n 186.49593 = avgFieldLength\n 163.84 = fieldLength\n
> 3.816711E-5 = weight(title:titl in 28156) [], result of:\n 3.816711E-5 =
> score(doc=28156,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
> "20375":"\n15.052014 = sum of:\n 15.052014 = sum of:\n 15.052014 = max
> of:\n 15.052014 = weight(abstract:titl in 20369) [], result of:\n
> 15.052014 = score(doc=20369,freq=1.0 = termFreq=1.0\n), product of:\n
> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.3674331 =
> tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
> = parameter b\n 186.49593 = avgFieldLength\n 64.0 = fieldLength\n
> 3.816711E-5 = weight(title:titl in 20369) [], result of:\n 3.816711E-5 =
> score(doc=20369,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
> "20381":"\n15.052014 = sum of:\n 15.052014 = sum of:\n 15.052014 = max
> of:\n 15.052014 = weight(abstract:titl in 20375) [], result of:\n
> 15.052014 = score(doc=20375,freq=1.0 = termFreq=1.0\n), product of:\n
> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.3674331 =
> tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
> = parameter b\n 186.49593 = avgFieldLength\n 64.0 = fieldLength\n
> 3.816711E-5 = weight(title:titl in 20375) [], result of:\n 3.816711E-5 =
> score(doc=20375,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
> "29030":"\n13.699375 = sum of:\n 13.699375 = sum of:\n 13.699375 = max
> of:\n 13.699375 = weight(abstract:titl in 28959) [], result of:\n
> 13.699375 = score(doc=28959,freq=2.0 = termFreq=2.0\n), product of:\n
> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.2445496 =
> tfNorm, computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.75
> = parameter b\n 186.49593 = avgFieldLength\n 256.0 = fieldLength\n
> 3.816711E-5 = weight(title:titl in 28959) [], result of:\n 3.816711E-5 =
> score(doc=28959,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
> "31444":"\n13.699375 = sum of:\n 13.699375 = sum of:\n 13.699375 = max
> of:\n 13.699375 = weight(abstract:titl in 31373) [], result of:\n
> 13.699375 = score(doc=31373,freq=2.0 = termFreq=2.0\n), product of:\n
> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.2445496 =
> tfNorm, computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.75
> = parameter b\n 186.49593 = avgFieldLength\n 256.0 = fieldLength\n
> 3.816711E-5 = weight(title:titl in 31373) [], result of:\n 3.816711E-5 =
> score(doc=31373,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
> "30621":"\n13.096554 = sum of:\n 13.096554 = sum of:\n 13.096554 = max
> of:\n 13.096554 = weight(abstract:titl in 30550) [], result of:\n
> 13.096554 = score(doc=30550,freq=1.0 = termFreq=1.0\n), product of:\n
> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.189785 =
> tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
> = parameter b\n 186.49593 = avgFieldLength\n 113.77778 = fieldLength\n
> 3.816711E-5 = weight(title:titl in 30550) [], result of:\n 3.816711E-5 =
> score(doc=30550,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
> "32067":"\n13.096554 = sum of:\n 13.096554 = sum of:\n 13.096554 = max
> of:\n 13.096554 = weight(abstract:titl in 31996) [], result of:\n
> 13.096554 = score(doc=31996,freq=1.0 = termFreq=1.0\n), product of:\n
> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.189785 =
> tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
> = parameter b\n 186.49593 = avgFieldLength\n 113.77778 = fieldLength\n
> 3.816711E-5 = weight(title:titl in 31996) [], result of:\n 3.816711E-5 =
> score(doc=31996,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
> "1935":"\n11.583146 = sum of:\n 11.583146 = sum of:\n 11.583146 = max
> of:\n 11.583146 = weight(abstract:titl in 1934) [], result of:\n
> 11.583146 = score(doc=1934,freq=1.0 = termFreq=1.0\n), product of:\n 2.0
> = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.0522962 =
> tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
> = parameter b\n 186.49593 = avgFieldLength\n 163.84 = fieldLength\n
> 3.816711E-5 = weight(title:titl in 1934) [], result of:\n 3.816711E-5 =
> score(doc=1934,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n"},
> "QParser":"DisMaxQParser", "altquerystring":null, "boostfuncs":null,
>
> Kind regards,
> Darko Todoric
>
> On 08/28/2017 06:35 PM, Erick Erickson wrote:
> > What are the results of adding &debug=query to the URL? The parsed
> > query will be especially illuminating.
> >
> > Best,
> > Erick
> >
> > On Mon, Aug 28, 2017 at 4:37 AM, Emir Arnautovic
> > <em...@sematext.com> wrote:
> >> Hi Darko,
> >>
> >> The issue is the wrong expectations: title-1-end is parsed to 3 tokens
> >> (guessing) and mm=99% of 3 tokens is 2.99 and it is rounded down to 2.
> Since
> >> all your documents have 'title' and 'end' tokens, all match. If you
> want to
> >> round up, you can use mm=-1% - that will result in zero (or one match
> if you
> >> do not filter out original document).
> >>
> >> You have to play with your tokenizers and define what is similarity
> match
> >> percentage (if you want to stick with mm).
> >>
> >> Regards,
> >> Emir
> >>
> >>
> >>
> >> On 28.08.2017 09:17, Darko Todoric wrote:
> >>> Hm... I cannot make that this DisMax work on my Solr...
> >>>
> >>> In solr I have document with title:
> >>>   - "title-1-end"
> >>>   - "title-2-end"
> >>>   - "title-3-end"
> >>>   - ...
> >>>   - ...
> >>>   - "title-312-end"
> >>>
> >>> and when I make query
> >>> "*
> http://localhost:8983/solr/SciLit/select?defType=dismax&indent=on&mm=99%&q=title
> :"title-123123123-end"&wt=json*'
> >>> I get all documents from solr :\
> >>> What I doing wrong?
> >>>
> >>> Also, I don't know if affecting results, but on "title" field I use
> >>> "WhitespaceTokenizerFactory".
> >>>
> >>> Kind regards,
> >>> Darko
> >>>
> >>>
> >>> On 08/25/2017 06:38 PM, Junte Zhang wrote:
> >>>> If you already have the title of the document, then you could run that
> >>>> title as a new query against the whole index and exclude the source
> document
> >>>> from the results as a filter.
> >>>>
> >>>> You could use the DisMax query parser:
> >>>>
> https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser
> >>>>
> >>>> And then set the minimum match ratio of the OR clauses to 90%.
> >>>>
> >>>> /JZ
> >>>>
> >>>> -----Original Message-----
> >>>> From: Darko Todoric [mailto:todoric@mdpi.com]
> >>>> Sent: Friday, August 25, 2017 5:49 PM
> >>>> To: solr-user@lucene.apache.org
> >>>> Subject: Search by similarity?
> >>>>
> >>>> Hi,
> >>>>
> >>>>
> >>>> I have 90.000.000 documents in Solr and I need to compare "title" of
> this
> >>>> document and get all documents with more than 80% similarity. PHP have
> >>>> "similar_text" but it's not so smart inserting 90m documents in the
> array...
> >>>> Can I do some query in Solr which will give me the more the 80%
> >>>> similarity?
> >>>>
> >>>>
> >>>> Kind regards,
> >>>> Darko Todoric
> >>>>
> >>>> --
> >>>> Darko Todoric
> >>>> Web Engineer, MDPI DOO
> >>>> Veljka Dugosevica 54, 11060 Belgrade, Serbia
> >>>> +381 65 43 90 620
> >>>> www.mdpi.com
> >>>>
> >>>> Disclaimer: The information and files contained in this message are
> >>>> confidential and intended solely for the use of the individual or
> entity to
> >>>> whom they are addressed.
> >>>> f you have received this message in error, please notify me and delete
> >>>> this message from your system.
> >>>> You may not copy this message in its entirety or in part, or disclose
> its
> >>>> contents to anyone.
> >>>>
> >> --
> >> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> >> Solr & Elasticsearch Support * http://sematext.com/
> >>
>
> --
> Darko Todoric
> Web Engineer, MDPI DOO
> Veljka Dugosevica 54, 11060 Belgrade, Serbia
> +381 65 43 90 620
> www.mdpi.com
>
> Disclaimer: The information and files contained in this message are
> confidential
> and intended solely for the use of the individual or entity to whom they
> are addressed.
> f you have received this message in error, please notify me and delete
> this message from your system.
> You may not copy this message in its entirety or in part, or disclose its
> contents to anyone.
>
>

Re: Search by similarity?

Posted by Darko Todoric <to...@mdpi.com>.

Hi Erick,

"debug":{ "rawquerystring":"title:\"title-123123123-end\"", 
"querystring":"title:\"title-123123123-end\"", 
"parsedquery":"(+(DisjunctionMaxQuery(((author_full:title)^7.0 | 
(abstract:titl)^2.0 | (title:titl)^3.0 | (keywords:titl)^5.0 | 
(authors:title)^4.0 | (doi:title:)^1.0)) 
DisjunctionMaxQuery(((author_full:\"title 123123123 end\"~1)^7.0 | 
(abstract:\"titl 123123123 end\"~1)^2.0 | (title:\"titl 123123123 
end\"~1)^3.0 | (keywords:\"titl 123123123 end\"~1)^5.0 | 
(authors:\"title 123123123 end\"~1)^4.0 | 
(doi:title-123123123-end)^1.0)))~1 ())/no_coord", 
"parsedquery_toString":"+((((author_full:title)^7.0 | 
(abstract:titl)^2.0 | (title:titl)^3.0 | (keywords:titl)^5.0 | 
(authors:title)^4.0 | (doi:title:)^1.0) ((author_full:\"title 123123123 
end\"~1)^7.0 | (abstract:\"titl 123123123 end\"~1)^2.0 | (title:\"titl 
123123123 end\"~1)^3.0 | (keywords:\"titl 123123123 end\"~1)^5.0 | 
(authors:\"title 123123123 end\"~1)^4.0 | 
(doi:title-123123123-end)^1.0))~1) ()", "explain":{ "23251":"\n16.848969 
= sum of:\n 16.848969 = sum of:\n 16.848969 = max of:\n 16.848969 = 
weight(abstract:titl in 23194) [], result of:\n 16.848969 = 
score(doc=23194,freq=1.0 = termFreq=1.0\n), product of:\n 2.0 = boost\n 
5.503748 = idf(docFreq=74, docCount=18297)\n 1.5306814 = tfNorm, 
computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = 
parameter b\n 186.49593 = avgFieldLength\n 28.444445 = fieldLength\n 
3.816711E-5 = weight(title:titl in 23194) [], result of:\n 3.816711E-5 = 
score(doc=23194,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n 
1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm, 
computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = 
parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n", 
"20495":"\n16.169483 = sum of:\n 16.169483 = sum of:\n 16.169483 = max 
of:\n 16.169483 = weight(abstract:titl in 20489) [], result of:\n 
16.169483 = score(doc=20489,freq=1.0 = termFreq=1.0\n), product of:\n 
2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.468952 = 
tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 
= parameter b\n 186.49593 = avgFieldLength\n 40.96 = fieldLength\n 
3.816711E-5 = weight(title:titl in 20489) [], result of:\n 3.816711E-5 = 
score(doc=20489,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n 
1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm, 
computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = 
parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n", 
"28227":"\n15.670726 = sum of:\n 15.670726 = sum of:\n 15.670726 = max 
of:\n 15.670726 = weight(abstract:titl in 28156) [], result of:\n 
15.670726 = score(doc=28156,freq=2.0 = termFreq=2.0\n), product of:\n 
2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.4236413 = 
tfNorm, computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.75 
= parameter b\n 186.49593 = avgFieldLength\n 163.84 = fieldLength\n 
3.816711E-5 = weight(title:titl in 28156) [], result of:\n 3.816711E-5 = 
score(doc=28156,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n 
1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm, 
computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = 
parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n", 
"20375":"\n15.052014 = sum of:\n 15.052014 = sum of:\n 15.052014 = max 
of:\n 15.052014 = weight(abstract:titl in 20369) [], result of:\n 
15.052014 = score(doc=20369,freq=1.0 = termFreq=1.0\n), product of:\n 
2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.3674331 = 
tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 
= parameter b\n 186.49593 = avgFieldLength\n 64.0 = fieldLength\n 
3.816711E-5 = weight(title:titl in 20369) [], result of:\n 3.816711E-5 = 
score(doc=20369,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n 
1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm, 
computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = 
parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n", 
"20381":"\n15.052014 = sum of:\n 15.052014 = sum of:\n 15.052014 = max 
of:\n 15.052014 = weight(abstract:titl in 20375) [], result of:\n 
15.052014 = score(doc=20375,freq=1.0 = termFreq=1.0\n), product of:\n 
2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.3674331 = 
tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 
= parameter b\n 186.49593 = avgFieldLength\n 64.0 = fieldLength\n 
3.816711E-5 = weight(title:titl in 20375) [], result of:\n 3.816711E-5 = 
score(doc=20375,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n 
1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm, 
computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = 
parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n", 
"29030":"\n13.699375 = sum of:\n 13.699375 = sum of:\n 13.699375 = max 
of:\n 13.699375 = weight(abstract:titl in 28959) [], result of:\n 
13.699375 = score(doc=28959,freq=2.0 = termFreq=2.0\n), product of:\n 
2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.2445496 = 
tfNorm, computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.75 
= parameter b\n 186.49593 = avgFieldLength\n 256.0 = fieldLength\n 
3.816711E-5 = weight(title:titl in 28959) [], result of:\n 3.816711E-5 = 
score(doc=28959,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n 
1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm, 
computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = 
parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n", 
"31444":"\n13.699375 = sum of:\n 13.699375 = sum of:\n 13.699375 = max 
of:\n 13.699375 = weight(abstract:titl in 31373) [], result of:\n 
13.699375 = score(doc=31373,freq=2.0 = termFreq=2.0\n), product of:\n 
2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.2445496 = 
tfNorm, computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.75 
= parameter b\n 186.49593 = avgFieldLength\n 256.0 = fieldLength\n 
3.816711E-5 = weight(title:titl in 31373) [], result of:\n 3.816711E-5 = 
score(doc=31373,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n 
1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm, 
computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = 
parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n", 
"30621":"\n13.096554 = sum of:\n 13.096554 = sum of:\n 13.096554 = max 
of:\n 13.096554 = weight(abstract:titl in 30550) [], result of:\n 
13.096554 = score(doc=30550,freq=1.0 = termFreq=1.0\n), product of:\n 
2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.189785 = 
tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 
= parameter b\n 186.49593 = avgFieldLength\n 113.77778 = fieldLength\n 
3.816711E-5 = weight(title:titl in 30550) [], result of:\n 3.816711E-5 = 
score(doc=30550,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n 
1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm, 
computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = 
parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n", 
"32067":"\n13.096554 = sum of:\n 13.096554 = sum of:\n 13.096554 = max 
of:\n 13.096554 = weight(abstract:titl in 31996) [], result of:\n 
13.096554 = score(doc=31996,freq=1.0 = termFreq=1.0\n), product of:\n 
2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.189785 = 
tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 
= parameter b\n 186.49593 = avgFieldLength\n 113.77778 = fieldLength\n 
3.816711E-5 = weight(title:titl in 31996) [], result of:\n 3.816711E-5 = 
score(doc=31996,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n 
1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm, 
computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = 
parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n", 
"1935":"\n11.583146 = sum of:\n 11.583146 = sum of:\n 11.583146 = max 
of:\n 11.583146 = weight(abstract:titl in 1934) [], result of:\n 
11.583146 = score(doc=1934,freq=1.0 = termFreq=1.0\n), product of:\n 2.0 
= boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.0522962 = 
tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 
= parameter b\n 186.49593 = avgFieldLength\n 163.84 = fieldLength\n 
3.816711E-5 = weight(title:titl in 1934) [], result of:\n 3.816711E-5 = 
score(doc=1934,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n 
1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm, 
computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = 
parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n"}, 
"QParser":"DisMaxQParser", "altquerystring":null, "boostfuncs":null,

Kind regards,
Darko Todoric

On 08/28/2017 06:35 PM, Erick Erickson wrote:
> What are the results of adding &debug=query to the URL? The parsed
> query will be especially illuminating.
>
> Best,
> Erick
>
> On Mon, Aug 28, 2017 at 4:37 AM, Emir Arnautovic
> <em...@sematext.com> wrote:
>> Hi Darko,
>>
>> The issue is the wrong expectations: title-1-end is parsed to 3 tokens
>> (guessing) and mm=99% of 3 tokens is 2.99 and it is rounded down to 2. Since
>> all your documents have 'title' and 'end' tokens, all match. If you want to
>> round up, you can use mm=-1% - that will result in zero (or one match if you
>> do not filter out original document).
>>
>> You have to play with your tokenizers and define what is similarity match
>> percentage (if you want to stick with mm).
>>
>> Regards,
>> Emir
>>
>>
>>
>> On 28.08.2017 09:17, Darko Todoric wrote:
>>> Hm... I cannot make that this DisMax work on my Solr...
>>>
>>> In solr I have document with title:
>>>   - "title-1-end"
>>>   - "title-2-end"
>>>   - "title-3-end"
>>>   - ...
>>>   - ...
>>>   - "title-312-end"
>>>
>>> and when I make query
>>> "*http://localhost:8983/solr/SciLit/select?defType=dismax&indent=on&mm=99%&q=title:"title-123123123-end"&wt=json*'
>>> I get all documents from solr :\
>>> What I doing wrong?
>>>
>>> Also, I don't know if affecting results, but on "title" field I use
>>> "WhitespaceTokenizerFactory".
>>>
>>> Kind regards,
>>> Darko
>>>
>>>
>>> On 08/25/2017 06:38 PM, Junte Zhang wrote:
>>>> If you already have the title of the document, then you could run that
>>>> title as a new query against the whole index and exclude the source document
>>>> from the results as a filter.
>>>>
>>>> You could use the DisMax query parser:
>>>> https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser
>>>>
>>>> And then set the minimum match ratio of the OR clauses to 90%.
>>>>
>>>> /JZ
>>>>
>>>> -----Original Message-----
>>>> From: Darko Todoric [mailto:todoric@mdpi.com]
>>>> Sent: Friday, August 25, 2017 5:49 PM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Search by similarity?
>>>>
>>>> Hi,
>>>>
>>>>
>>>> I have 90.000.000 documents in Solr and I need to compare "title" of this
>>>> document and get all documents with more than 80% similarity. PHP have
>>>> "similar_text" but it's not so smart inserting 90m documents in the array...
>>>> Can I do some query in Solr which will give me the more the 80%
>>>> similarity?
>>>>
>>>>
>>>> Kind regards,
>>>> Darko Todoric
>>>>
>>>> --
>>>> Darko Todoric
>>>> Web Engineer, MDPI DOO
>>>> Veljka Dugosevica 54, 11060 Belgrade, Serbia
>>>> +381 65 43 90 620
>>>> www.mdpi.com
>>>>
>>>> Disclaimer: The information and files contained in this message are
>>>> confidential and intended solely for the use of the individual or entity to
>>>> whom they are addressed.
>>>> f you have received this message in error, please notify me and delete
>>>> this message from your system.
>>>> You may not copy this message in its entirety or in part, or disclose its
>>>> contents to anyone.
>>>>
>> --
>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>> Solr & Elasticsearch Support * http://sematext.com/
>>

-- 
Darko Todoric
Web Engineer, MDPI DOO
Veljka Dugosevica 54, 11060 Belgrade, Serbia
+381 65 43 90 620
www.mdpi.com

Disclaimer: The information and files contained in this message are confidential
and intended solely for the use of the individual or entity to whom they are addressed.
f you have received this message in error, please notify me and delete this message from your system.
You may not copy this message in its entirety or in part, or disclose its contents to anyone.

Re: Search by similarity?

Posted by Erick Erickson <er...@gmail.com>.

What are the results of adding &debug=query to the URL? The parsed
query will be especially illuminating.

Best,
Erick

On Mon, Aug 28, 2017 at 4:37 AM, Emir Arnautovic
<em...@sematext.com> wrote:
> Hi Darko,
>
> The issue is the wrong expectations: title-1-end is parsed to 3 tokens
> (guessing) and mm=99% of 3 tokens is 2.99 and it is rounded down to 2. Since
> all your documents have 'title' and 'end' tokens, all match. If you want to
> round up, you can use mm=-1% - that will result in zero (or one match if you
> do not filter out original document).
>
> You have to play with your tokenizers and define what is similarity match
> percentage (if you want to stick with mm).
>
> Regards,
> Emir
>
>
>
> On 28.08.2017 09:17, Darko Todoric wrote:
>>
>> Hm... I cannot make that this DisMax work on my Solr...
>>
>> In solr I have document with title:
>>  - "title-1-end"
>>  - "title-2-end"
>>  - "title-3-end"
>>  - ...
>>  - ...
>>  - "title-312-end"
>>
>> and when I make query
>> "*http://localhost:8983/solr/SciLit/select?defType=dismax&indent=on&mm=99%&q=title:"title-123123123-end"&wt=json*'
>> I get all documents from solr :\
>> What I doing wrong?
>>
>> Also, I don't know if affecting results, but on "title" field I use
>> "WhitespaceTokenizerFactory".
>>
>> Kind regards,
>> Darko
>>
>>
>> On 08/25/2017 06:38 PM, Junte Zhang wrote:
>>>
>>> If you already have the title of the document, then you could run that
>>> title as a new query against the whole index and exclude the source document
>>> from the results as a filter.
>>>
>>> You could use the DisMax query parser:
>>> https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser
>>>
>>> And then set the minimum match ratio of the OR clauses to 90%.
>>>
>>> /JZ
>>>
>>> -----Original Message-----
>>> From: Darko Todoric [mailto:todoric@mdpi.com]
>>> Sent: Friday, August 25, 2017 5:49 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Search by similarity?
>>>
>>> Hi,
>>>
>>>
>>> I have 90.000.000 documents in Solr and I need to compare "title" of this
>>> document and get all documents with more than 80% similarity. PHP have
>>> "similar_text" but it's not so smart inserting 90m documents in the array...
>>> Can I do some query in Solr which will give me the more the 80%
>>> similarity?
>>>
>>>
>>> Kind regards,
>>> Darko Todoric
>>>
>>> --
>>> Darko Todoric
>>> Web Engineer, MDPI DOO
>>> Veljka Dugosevica 54, 11060 Belgrade, Serbia
>>> +381 65 43 90 620
>>> www.mdpi.com
>>>
>>> Disclaimer: The information and files contained in this message are
>>> confidential and intended solely for the use of the individual or entity to
>>> whom they are addressed.
>>> f you have received this message in error, please notify me and delete
>>> this message from your system.
>>> You may not copy this message in its entirety or in part, or disclose its
>>> contents to anyone.
>>>
>>
>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>

Re: Search by similarity?

Posted by Emir Arnautovic <em...@sematext.com>.

Hi Darko,

The issue is the wrong expectations: title-1-end is parsed to 3 tokens 
(guessing) and mm=99% of 3 tokens is 2.99 and it is rounded down to 2. 
Since all your documents have 'title' and 'end' tokens, all match. If 
you want to round up, you can use mm=-1% - that will result in zero (or 
one match if you do not filter out original document).

You have to play with your tokenizers and define what is similarity 
match percentage (if you want to stick with mm).

Regards,
Emir


On 28.08.2017 09:17, Darko Todoric wrote:
> Hm... I cannot make that this DisMax work on my Solr...
>
> In solr I have document with title:
>  - "title-1-end"
>  - "title-2-end"
>  - "title-3-end"
>  - ...
>  - ...
>  - "title-312-end"
>
> and when I make query 
> "*http://localhost:8983/solr/SciLit/select?defType=dismax&indent=on&mm=99%&q=title:"title-123123123-end"&wt=json*' 
> I get all documents from solr :\
> What I doing wrong?
>
> Also, I don't know if affecting results, but on "title" field I use 
> "WhitespaceTokenizerFactory".
>
> Kind regards,
> Darko
>
>
> On 08/25/2017 06:38 PM, Junte Zhang wrote:
>> If you already have the title of the document, then you could run 
>> that title as a new query against the whole index and exclude the 
>> source document from the results as a filter.
>>
>> You could use the DisMax query parser: 
>> https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser
>>
>> And then set the minimum match ratio of the OR clauses to 90%.
>>
>> /JZ
>>
>> -----Original Message-----
>> From: Darko Todoric [mailto:todoric@mdpi.com]
>> Sent: Friday, August 25, 2017 5:49 PM
>> To: solr-user@lucene.apache.org
>> Subject: Search by similarity?
>>
>> Hi,
>>
>>
>> I have 90.000.000 documents in Solr and I need to compare "title" of 
>> this document and get all documents with more than 80% similarity. 
>> PHP have "similar_text" but it's not so smart inserting 90m documents 
>> in the array...
>> Can I do some query in Solr which will give me the more the 80% 
>> similarity?
>>
>>
>> Kind regards,
>> Darko Todoric
>>
>> -- 
>> Darko Todoric
>> Web Engineer, MDPI DOO
>> Veljka Dugosevica 54, 11060 Belgrade, Serbia
>> +381 65 43 90 620
>> www.mdpi.com
>>
>> Disclaimer: The information and files contained in this message are 
>> confidential and intended solely for the use of the individual or 
>> entity to whom they are addressed.
>> f you have received this message in error, please notify me and 
>> delete this message from your system.
>> You may not copy this message in its entirety or in part, or disclose 
>> its contents to anyone.
>>
>

-- 
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

Re: Search by similarity?

Posted by Darko Todoric <to...@mdpi.com>.

Hm... I cannot make that this DisMax work on my Solr...

In solr I have document with title:
  - "title-1-end"
  - "title-2-end"
  - "title-3-end"
  - ...
  - ...
  - "title-312-end"

and when I make query 
"*http://localhost:8983/solr/SciLit/select?defType=dismax&indent=on&mm=99%&q=title:"title-123123123-end"&wt=json*' 
I get all documents from solr :\
What I doing wrong?

Also, I don't know if affecting results, but on "title" field I use 
"WhitespaceTokenizerFactory".

Kind regards,
Darko


On 08/25/2017 06:38 PM, Junte Zhang wrote:
> If you already have the title of the document, then you could run that title as a new query against the whole index and exclude the source document from the results as a filter.
>
> You could use the DisMax query parser: https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser
>
> And then set the minimum match ratio of the OR clauses to 90%.
>
> /JZ
>
> -----Original Message-----
> From: Darko Todoric [mailto:todoric@mdpi.com]
> Sent: Friday, August 25, 2017 5:49 PM
> To: solr-user@lucene.apache.org
> Subject: Search by similarity?
>
> Hi,
>
>
> I have 90.000.000 documents in Solr and I need to compare "title" of this document and get all documents with more than 80% similarity. PHP have "similar_text" but it's not so smart inserting 90m documents in the array...
> Can I do some query in Solr which will give me the more the 80% similarity?
>
>
> Kind regards,
> Darko Todoric
>
> --
> Darko Todoric
> Web Engineer, MDPI DOO
> Veljka Dugosevica 54, 11060 Belgrade, Serbia
> +381 65 43 90 620
> www.mdpi.com
>
> Disclaimer: The information and files contained in this message are confidential and intended solely for the use of the individual or entity to whom they are addressed.
> f you have received this message in error, please notify me and delete this message from your system.
> You may not copy this message in its entirety or in part, or disclose its contents to anyone.
>

-- 
Darko Todoric
Web Engineer, MDPI DOO
Veljka Dugosevica 54, 11060 Belgrade, Serbia
+381 65 43 90 620
www.mdpi.com

Disclaimer: The information and files contained in this message are confidential
and intended solely for the use of the individual or entity to whom they are addressed.
f you have received this message in error, please notify me and delete this message from your system.
You may not copy this message in its entirety or in part, or disclose its contents to anyone.

RE: Search by similarity?

Posted by Junte Zhang <Ju...@localsearch.ch>.

If you already have the title of the document, then you could run that title as a new query against the whole index and exclude the source document from the results as a filter.

You could use the DisMax query parser: https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser

And then set the minimum match ratio of the OR clauses to 90%.

/JZ

-----Original Message-----
From: Darko Todoric [mailto:todoric@mdpi.com] 
Sent: Friday, August 25, 2017 5:49 PM
To: solr-user@lucene.apache.org
Subject: Search by similarity?

Hi,


I have 90.000.000 documents in Solr and I need to compare "title" of this document and get all documents with more than 80% similarity. PHP have "similar_text" but it's not so smart inserting 90m documents in the array...
Can I do some query in Solr which will give me the more the 80% similarity?


Kind regards,
Darko Todoric

--
Darko Todoric
Web Engineer, MDPI DOO
Veljka Dugosevica 54, 11060 Belgrade, Serbia
+381 65 43 90 620
www.mdpi.com

Disclaimer: The information and files contained in this message are confidential and intended solely for the use of the individual or entity to whom they are addressed.
f you have received this message in error, please notify me and delete this message from your system.
You may not copy this message in its entirety or in part, or disclose its contents to anyone.

RE: Search by similarity?

Posted by Markus Jelsma <ma...@openindex.io>.

Yes, that is roughly how MLT works as well. You can also do a full OR-search on the terms using LuceneQParser.

Markus

 
 
-----Original message-----
> From:Junte Zhang <Ju...@localsearch.ch>
> Sent: Friday 25th August 2017 18:38
> To: solr-user@lucene.apache.org
> Subject: RE: Search by similarity?
> 
> If you already have the title of the document, then you could run that title as a new query against the whole index and exclude the source document from the results as a filter.
> 
> You could use the DisMax query parser: https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser
> 
> And then set the minimum match ratio of the OR clauses to 90%.
> 
> /JZ
> 
> -----Original Message-----
> From: Darko Todoric [mailto:todoric@mdpi.com] 
> Sent: Friday, August 25, 2017 5:49 PM
> To: solr-user@lucene.apache.org
> Subject: Search by similarity?
> 
> Hi,
> 
> 
> I have 90.000.000 documents in Solr and I need to compare "title" of this document and get all documents with more than 80% similarity. PHP have "similar_text" but it's not so smart inserting 90m documents in the array...
> Can I do some query in Solr which will give me the more the 80% similarity?
> 
> 
> Kind regards,
> Darko Todoric
> 
> --
> Darko Todoric
> Web Engineer, MDPI DOO
> Veljka Dugosevica 54, 11060 Belgrade, Serbia
> +381 65 43 90 620
> www.mdpi.com
> 
> Disclaimer: The information and files contained in this message are confidential and intended solely for the use of the individual or entity to whom they are addressed.
> f you have received this message in error, please notify me and delete this message from your system.
> You may not copy this message in its entirety or in part, or disclose its contents to anyone.
> 
>

Re: Search by similarity?

Posted by "alessandro.benedetti" <a....@sease.io>.

In addition to that, I still believe More Like This is a better option for
you.
The reason is that the MLT is able to evaluate the interesting terms from
your document (title is the only field of interest for you), and boost them
accordingly.

Related your "80% of similarity", this is more tricky.
You can potentially calculate the score of the identical document and then
render the score of the similar ones normalised based on that.

Normally it's useless to show the score value per se, but in the case of MLT
it actually make sense to give a percentage score result.
Indeed it could be a good addition to the MLT.

Regards





-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html