You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by Marco D'Ambra <m....@volocom.it> on 2022/03/10 17:19:04 UTC

Question regarding the MoreLikeThis features

Hi all,
This is my first time writing to this mailing list and I would like to thank you in advance for your attention.
I am writing because I am having problems using the "MoreLikeThis" features.
I am working in a Solr cluster (version 8.11.1) consisting of multiple nodes, each of which contains multiple shards.

It is a quite big cluster and data is sharded using implicit routing and documents are distributed by date on monthly shards.

Here are the fields that I'm using:

  *   UniqueReference: the unique reference of a document
  *   DocumentDate: the date of a document (in the standar Solr format)
  *   DataType: the data type of the document (let's say that can be A or B)
  *   Content: the content of a document (a string)
Here is what my managed schema looks like
...
<field name="UniqueReference" type="string" indexed="true" stored="true" required="true" />

<field name="DocumentDate" type="pdate" indexed="true" stored="false" required="true" />

<field name="DataType" type="string" indexed="true" stored="false" required="true" />

<field name="Content_en" type="text_en" indexed="true" stored="true" required="false" />
...


The task that I want to perform is the following:
Given the unique reference of a document of type A, I want to find the documents of data type B and in a fixed time interval, that have the most similar content.
Here the first questions:

  1.  Which is the best solr request to perform this task?
  2.  Is there a parameter that allows me to restrict the corpus of documents that are analyzed for the return of similar contents? it should be noted that this corpus of documents may not contain the initial document from which I am starting
Initially I thought about using the "mlt" endpoint, but since there was no parameter in the documentation that would allow me to select the shard on which to direct the query (I absolutely need it, otherwise I risk putting a strain on my cluster), I opted to use the "select" endpoint, with the "mlt" parameter set to true, and the "shards" parameter.
Those are the parameters that I am using:

  *   q: "UniqueReference:doc_id"
  *   fq: "(DocumentDate:[2022-01-22T00:00:00Z TO 2022-01-26T00:00:00Z] AND DataType:B) OR (UniqueReference:doc_id)"
  *   mlt: true
  *   mlt.fl: "Content"
  *   shards: "shard_202201"
I realize that the "fq" parameter is used in a bizarre way. In theory it should be aimed at the documents of the main query (in my case the source document). It is an attempt to solve problem (2) (which didn't work, actually).
Anyway, my doubts are not limited to this. What really surprises me is the structure of the response that Solr returns to me.
The content of response looks like this:
{
"response" : {
"docs" : [],
...
}
                "moreLikeThis" : ...
                }
The weird stuff appear in the "moreLikeThis" part. Sometimes Solr is returning me a list, other times a dictionary. Repeating the same call several times the two possibilities occur repeatedly, apparently without a logical pattern, and I have not been able to understand why.
And to be precise, in both cases the documents contained in the answer are not necessarily of data type B, as requested by me with the "fq" parameter.
In the "dictionary" case, there is only one key, which is the UniqueReference of the source document and the corresponding value are similar documents.
In the "list" case, the second element contains the required documents
So, here is the last question:

  1.  I am perfectly aware that I am lost, therefore, what I'm missing?
I thank everyone for the attention you have dedicated to me. Greetings from Italy.
I'm available for clarifications,

Marco


Re: Question regarding the MoreLikeThis features

Posted by Alessandro Benedetti <a....@sease.io>.
Yes, Marco, I think you are on the right track!
The bug you linked is relevant, feel free to fix it and I'll be glad to
help in review and commit.
In case you need someone to fix it, let me know!

Cheers
--------------------------
Alessandro Benedetti
Apache Lucene/Solr PMC member and Committer
Director, R&D Software Engineer, Search Consultant

www.sease.io


On Mon, 21 Mar 2022 at 11:09, Marco D'Ambra <m....@volocom.it> wrote:

> Hi Tim and Alessandro,
>
> thank you very much for your helpful answers. I finally begin to
> understand how it works.
>
> Following the advice of Alessandro, I did some tests using the !mlt query
> parser.
> If I'm not mistaken, right now I am facing another bug.
>
> To try to explain my situation, here is a query:
>
>
> http://solrnode:8080/solr/my_collection/select?defType=lucene&q={!mlt&qf=Content}DOC_ID
>
> And this is the response:
>
> {
>   "error":{
>     "metadata":[
>       "error-class","org.apache.solr.common.SolrException",
>       "root-error-class","org.apache.solr.common.SolrException"],
>     "msg":"Error completing MLT request. Could not fetch document with id
> [DOC_ID]",
>     "code":400
> }
> }
>
> If I understand correctly, the problem is that I don't use the default
> compositeId router (I use a Time Routed Alias instead).
> I also found an open issue on the matter:
>
> https://issues.apache.org/jira/browse/SOLR-15615
>
> Do you think I'm on the right track? Do you have any advice on this?
>
> Thank you so much and have a nice day.
>
> Marco
>
> -----Original Message-----
> From: Alessandro Benedetti <a....@sease.io>
> Sent: giovedì 17 marzo 2022 13:05
> To: users@solr.apache.org
> Subject: Re: Question regarding the MoreLikeThis features
>
> Hi Marco,
> I have been working for a long time on the Apache Lucene More Like This
> component and its integration in Apache Solr as a committer.
> Let me try to summarise a bit how it works to help you with your use case.
> You can find benefits from a presentation I gave in Tokyo for the Open
> Source summit in 2017(
> https://www.slideshare.net/SeaseLtd/how-the-lucene-more-like-this-works)
> and a London Lucene/Solr meetup from 2019 (
> https://www.youtube.com/watch?v=jkaj89XwHHw partial).
>
> First of all, the More Like This implementation is at Lucene level:
>
>    - it takes as input a document Id(that is fetched from the index) or a
>    stream of text
>    - it takes various parameters regarding the fields to use and the
>    minimum frequencies to consider
>    - uses TF/IDF to assign a score to each term of the original document
> *(considering
>    the whole local index as the corpus)*
>    - return *a query with the list of terms*, potentially boosted by the
>    term importance (it's the boost parameter)
>
> In the past, I produced plenty of work to restructure the module, to
> support BM25 for scoring the terms and make the component more
> readable/maintainable but didn't get enough traction, and the contribution
> stalled (I am open to resuscitating it if there's interest).
>
> In Apache Solr you have the More Like This integrated in three ways (
> https://issues.apache.org/jira/browse/SOLR-13172 adds some details):
> 1) query component (it is what happens when you add mlt=true as a request
> parameter and include the mlt component in a request handler) This
> calculates and runs a MLT query for each document in the search results.
> 2) request handler
> 3) query parser -> just build the MLT query
>
> Coming back to your question:
>
> *1.  Which is the best solr request to perform this task?* If you want to
> find documents from a document ID, the best option is to use
> the* !mlt query parser*, it's compatible with SolrCloud.
> The request is processed by a Solr node, it fetches the document
> potentially from another shard, and then it builds *locally* the mlt query.
> Bear in mind the entire local shard is used for the document frequencies
> calculations, if your shards are skewed you should use global IDF.
> Using this approach you can build your final query as you like, so you can
> add additional boolean clauses and filters on top of the MLT query.
>
> *  2.  Is there a parameter that allows me to restrict the corpus of
> documents that are analyzed for the return of similar contents? it should
> be noted that this corpus of documents may not contain the initial document
> from which I am starting* No, at the moment the entire corpus of the local
> core that is processing the request, is used to calculate the importance of
> the terms in the seed document.
> As I said before, it's not that important if the document is there locally
> or not, as the more like this query parser is SolrCloud compatible.
> But if you want to limit the corpus for the term importance calculations,
> some Lucene customizations are needed.
>
> Hope this helps,
>
> Cheers
>
> --------------------------
> Alessandro Benedetti
> Apache Lucene/Solr PMC member and Committer Director, R&D Software
> Engineer, Search Consultant
>
> www.sease.io
>
>
> On Mon, 14 Mar 2022 at 22:05, Tim Casey <tc...@gmail.com> wrote:
>
> > Hi,
> >
> > > Regarding the specific problem on the existence of a specific
> > > parameter
> > to restrict the corpus of documents that are analyzed for the return
> > of similar contents
> >
> > If you can get this to be a query, and one which might be ordered in a
> > useful way, then you are very likely to see what you need in the top
> > 500 results.  This would be enough for most usage.
> > The 'likely' would need to be computed and measured as you produce
> results.
> >
> >
> > In any event, to restrict the corpus you build a query bit set and use
> > that as a filter.  This is fairly easy to code so you can see the
> > results and give yourself a way to experiment on what you would do,
> > before deciding how/what to do any one particular way.
> >
> > Or, you directly query and allow solr to do the needed computations
> > within each shard.  At this point, I would recommend people who are
> > more versed in solr specifics for this kind of computation.
> >
> > On Mon, Mar 14, 2022 at 12:56 AM Marco D'Ambra <m....@volocom.it>
> > wrote:
> >
> > > Hi Tim,
> > >
> > > thank you very much for the answer, full of useful advice.
> > > I will try to put into practice what you told me to improve the
> > > output of the calls.
> > > Regarding the specific problem on the existence of a specific
> > > parameter
> > to
> > > restrict the corpus of documents that are analyzed for the return of
> > > similar contents, I must admit that I have not yet figured out how
> > > to proceed.
> > >
> > > Thank you very much and have a nice day,
> > >
> > > Marco
> > >
> > > -----Original Message-----
> > > From: Tim Casey <tc...@gmail.com>
> > > Sent: giovedì 10 marzo 2022 19:51
> > > To: users@solr.apache.org
> > > Subject: Re: Question regarding the MoreLikeThis features
> > >
> > > Marco,
> > >
> > > Finding 'similar' documents will end up being weighted by document
> > length.
> > > I would recommend, at the point of indexing, also indexing an
> > > ordered token set of the first 256, 1024 up to around 5k tokens
> > > (depending on document lengths).  What this does is allow a vector
> > > to vector normalized comparison.  You could then query for similar
> > > possibile documents
> > directly
> > > and build a normalized vector with respect to the query document.
> > >
> > > Normalizing schemes in something like an inverted index will tend to
> > > weight the lower token count documents over higher token count
> documents.
> > > So the above is an attempt to get at a normalized and comparable
> > > view between documents independent of size.  Next you end up
> > > normalizing by
> > the
> > > inverse of a commonality.  That is, a more common token is weighted
> > > lower than a least common token.  (I would also discount tokens
> > > which have a
> > raw
> > > frequency below 5.). At the point you have a normalized vector, you
> > > can
> > use
> > > that to find similarities weighted by more meaningful tokens.
> > >
> > > tim
> > >
> > > On Thu, Mar 10, 2022 at 9:18 AM Marco D'Ambra <m....@volocom.it>
> > wrote:
> > >
> > > > Hi all,
> > > > This is my first time writing to this mailing list and I would
> > > > like to thank you in advance for your attention.
> > > > I am writing because I am having problems using the "MoreLikeThis"
> > > > features.
> > > > I am working in a Solr cluster (version 8.11.1) consisting of
> > > > multiple nodes, each of which contains multiple shards.
> > > >
> > > > It is a quite big cluster and data is sharded using implicit
> > > > routing and documents are distributed by date on monthly shards.
> > > >
> > > > Here are the fields that I'm using:
> > > >
> > > >   *   UniqueReference: the unique reference of a document
> > > >   *   DocumentDate: the date of a document (in the standar Solr
> format)
> > > >   *   DataType: the data type of the document (let's say that can be
> A
> > or
> > > > B)
> > > >   *   Content: the content of a document (a string)
> > > > Here is what my managed schema looks like ...
> > > > <field name="UniqueReference" type="string" indexed="true"
> > stored="true"
> > > > required="true" />
> > > >
> > > > <field name="DocumentDate" type="pdate" indexed="true" stored="false"
> > > > required="true" />
> > > >
> > > > <field name="DataType" type="string" indexed="true" stored="false"
> > > > required="true" />
> > > >
> > > > <field name="Content_en" type="text_en" indexed="true" stored="true"
> > > > required="false" />
> > > > ...
> > > >
> > > >
> > > > The task that I want to perform is the following:
> > > > Given the unique reference of a document of type A, I want to find
> > > > the documents of data type B and in a fixed time interval, that
> > > > have the most similar content.
> > > > Here the first questions:
> > > >
> > > >   1.  Which is the best solr request to perform this task?
> > > >   2.  Is there a parameter that allows me to restrict the corpus
> > > > of documents that are analyzed for the return of similar contents?
> > > > it should be noted that this corpus of documents may not contain
> > > > the initial document from which I am starting Initially I thought
> > > > about using the "mlt" endpoint, but since there was no parameter
> > > > in the documentation that would allow me to select the shard on
> > > > which to direct the query (I absolutely need it, otherwise I risk
> > > > putting a strain on my cluster), I opted to use the "select"
> > > > endpoint, with the
> > > "mlt"
> > > > parameter set to true, and the "shards" parameter.
> > > > Those are the parameters that I am using:
> > > >
> > > >   *   q: "UniqueReference:doc_id"
> > > >   *   fq: "(DocumentDate:[2022-01-22T00:00:00Z TO
> 2022-01-26T00:00:00Z]
> > > > AND DataType:B) OR (UniqueReference:doc_id)"
> > > >   *   mlt: true
> > > >   *   mlt.fl: "Content"
> > > >   *   shards: "shard_202201"
> > > > I realize that the "fq" parameter is used in a bizarre way. In
> > > > theory it should be aimed at the documents of the main query (in
> > > > my case the source document). It is an attempt to solve problem
> > > > (2) (which didn't work, actually).
> > > > Anyway, my doubts are not limited to this. What really surprises
> > > > me is the structure of the response that Solr returns to me.
> > > > The content of response looks like this:
> > > > {
> > > > "response" : {
> > > > "docs" : [],
> > > > ...
> > > > }
> > > >                 "moreLikeThis" : ...
> > > >                 }
> > > > The weird stuff appear in the "moreLikeThis" part. Sometimes Solr
> > > > is returning me a list, other times a dictionary. Repeating the
> > > > same call several times the two possibilities occur repeatedly,
> > > > apparently without a logical pattern, and I have not been able to
> understand why.
> > > > And to be precise, in both cases the documents contained in the
> > > > answer are not necessarily of data type B, as requested by me with
> the "fq"
> > > parameter.
> > > > In the "dictionary" case, there is only one key, which is the
> > > > UniqueReference of the source document and the corresponding value
> > > > are similar documents.
> > > > In the "list" case, the second element contains the required
> > > > documents So, here is the last question:
> > > >
> > > >   1.  I am perfectly aware that I am lost, therefore, what I'm
> missing?
> > > > I thank everyone for the attention you have dedicated to me.
> > > > Greetings from Italy.
> > > > I'm available for clarifications,
> > > >
> > > > Marco
> > > >
> > > >
> > >
> >
>

RE: Question regarding the MoreLikeThis features

Posted by Marco D'Ambra <m....@volocom.it>.
Hi Tim and Alessandro,

thank you very much for your helpful answers. I finally begin to understand how it works.

Following the advice of Alessandro, I did some tests using the !mlt query parser.
If I'm not mistaken, right now I am facing another bug.

To try to explain my situation, here is a query:

http://solrnode:8080/solr/my_collection/select?defType=lucene&q={!mlt&qf=Content}DOC_ID

And this is the response:

{
  "error":{
    "metadata":[
      "error-class","org.apache.solr.common.SolrException",
      "root-error-class","org.apache.solr.common.SolrException"],
    "msg":"Error completing MLT request. Could not fetch document with id [DOC_ID]",
    "code":400
}
}

If I understand correctly, the problem is that I don't use the default compositeId router (I use a Time Routed Alias instead).
I also found an open issue on the matter:

https://issues.apache.org/jira/browse/SOLR-15615

Do you think I'm on the right track? Do you have any advice on this? 

Thank you so much and have a nice day.

Marco

-----Original Message-----
From: Alessandro Benedetti <a....@sease.io> 
Sent: giovedì 17 marzo 2022 13:05
To: users@solr.apache.org
Subject: Re: Question regarding the MoreLikeThis features

Hi Marco,
I have been working for a long time on the Apache Lucene More Like This component and its integration in Apache Solr as a committer.
Let me try to summarise a bit how it works to help you with your use case.
You can find benefits from a presentation I gave in Tokyo for the Open Source summit in 2017(
https://www.slideshare.net/SeaseLtd/how-the-lucene-more-like-this-works)
and a London Lucene/Solr meetup from 2019 ( https://www.youtube.com/watch?v=jkaj89XwHHw partial).

First of all, the More Like This implementation is at Lucene level:

   - it takes as input a document Id(that is fetched from the index) or a
   stream of text
   - it takes various parameters regarding the fields to use and the
   minimum frequencies to consider
   - uses TF/IDF to assign a score to each term of the original document *(considering
   the whole local index as the corpus)*
   - return *a query with the list of terms*, potentially boosted by the
   term importance (it's the boost parameter)

In the past, I produced plenty of work to restructure the module, to support BM25 for scoring the terms and make the component more readable/maintainable but didn't get enough traction, and the contribution stalled (I am open to resuscitating it if there's interest).

In Apache Solr you have the More Like This integrated in three ways (
https://issues.apache.org/jira/browse/SOLR-13172 adds some details):
1) query component (it is what happens when you add mlt=true as a request parameter and include the mlt component in a request handler) This calculates and runs a MLT query for each document in the search results.
2) request handler
3) query parser -> just build the MLT query

Coming back to your question:

*1.  Which is the best solr request to perform this task?* If you want to find documents from a document ID, the best option is to use
the* !mlt query parser*, it's compatible with SolrCloud.
The request is processed by a Solr node, it fetches the document potentially from another shard, and then it builds *locally* the mlt query.
Bear in mind the entire local shard is used for the document frequencies calculations, if your shards are skewed you should use global IDF.
Using this approach you can build your final query as you like, so you can add additional boolean clauses and filters on top of the MLT query.

*  2.  Is there a parameter that allows me to restrict the corpus of documents that are analyzed for the return of similar contents? it should be noted that this corpus of documents may not contain the initial document from which I am starting* No, at the moment the entire corpus of the local core that is processing the request, is used to calculate the importance of the terms in the seed document.
As I said before, it's not that important if the document is there locally or not, as the more like this query parser is SolrCloud compatible.
But if you want to limit the corpus for the term importance calculations, some Lucene customizations are needed.

Hope this helps,

Cheers

--------------------------
Alessandro Benedetti
Apache Lucene/Solr PMC member and Committer Director, R&D Software Engineer, Search Consultant

www.sease.io


On Mon, 14 Mar 2022 at 22:05, Tim Casey <tc...@gmail.com> wrote:

> Hi,
>
> > Regarding the specific problem on the existence of a specific 
> > parameter
> to restrict the corpus of documents that are analyzed for the return 
> of similar contents
>
> If you can get this to be a query, and one which might be ordered in a 
> useful way, then you are very likely to see what you need in the top 
> 500 results.  This would be enough for most usage.
> The 'likely' would need to be computed and measured as you produce results.
>
>
> In any event, to restrict the corpus you build a query bit set and use 
> that as a filter.  This is fairly easy to code so you can see the 
> results and give yourself a way to experiment on what you would do, 
> before deciding how/what to do any one particular way.
>
> Or, you directly query and allow solr to do the needed computations 
> within each shard.  At this point, I would recommend people who are 
> more versed in solr specifics for this kind of computation.
>
> On Mon, Mar 14, 2022 at 12:56 AM Marco D'Ambra <m....@volocom.it>
> wrote:
>
> > Hi Tim,
> >
> > thank you very much for the answer, full of useful advice.
> > I will try to put into practice what you told me to improve the 
> > output of the calls.
> > Regarding the specific problem on the existence of a specific 
> > parameter
> to
> > restrict the corpus of documents that are analyzed for the return of 
> > similar contents, I must admit that I have not yet figured out how 
> > to proceed.
> >
> > Thank you very much and have a nice day,
> >
> > Marco
> >
> > -----Original Message-----
> > From: Tim Casey <tc...@gmail.com>
> > Sent: giovedì 10 marzo 2022 19:51
> > To: users@solr.apache.org
> > Subject: Re: Question regarding the MoreLikeThis features
> >
> > Marco,
> >
> > Finding 'similar' documents will end up being weighted by document
> length.
> > I would recommend, at the point of indexing, also indexing an 
> > ordered token set of the first 256, 1024 up to around 5k tokens 
> > (depending on document lengths).  What this does is allow a vector 
> > to vector normalized comparison.  You could then query for similar 
> > possibile documents
> directly
> > and build a normalized vector with respect to the query document.
> >
> > Normalizing schemes in something like an inverted index will tend to 
> > weight the lower token count documents over higher token count documents.
> > So the above is an attempt to get at a normalized and comparable 
> > view between documents independent of size.  Next you end up 
> > normalizing by
> the
> > inverse of a commonality.  That is, a more common token is weighted 
> > lower than a least common token.  (I would also discount tokens 
> > which have a
> raw
> > frequency below 5.). At the point you have a normalized vector, you 
> > can
> use
> > that to find similarities weighted by more meaningful tokens.
> >
> > tim
> >
> > On Thu, Mar 10, 2022 at 9:18 AM Marco D'Ambra <m....@volocom.it>
> wrote:
> >
> > > Hi all,
> > > This is my first time writing to this mailing list and I would 
> > > like to thank you in advance for your attention.
> > > I am writing because I am having problems using the "MoreLikeThis"
> > > features.
> > > I am working in a Solr cluster (version 8.11.1) consisting of 
> > > multiple nodes, each of which contains multiple shards.
> > >
> > > It is a quite big cluster and data is sharded using implicit 
> > > routing and documents are distributed by date on monthly shards.
> > >
> > > Here are the fields that I'm using:
> > >
> > >   *   UniqueReference: the unique reference of a document
> > >   *   DocumentDate: the date of a document (in the standar Solr format)
> > >   *   DataType: the data type of the document (let's say that can be A
> or
> > > B)
> > >   *   Content: the content of a document (a string)
> > > Here is what my managed schema looks like ...
> > > <field name="UniqueReference" type="string" indexed="true"
> stored="true"
> > > required="true" />
> > >
> > > <field name="DocumentDate" type="pdate" indexed="true" stored="false"
> > > required="true" />
> > >
> > > <field name="DataType" type="string" indexed="true" stored="false"
> > > required="true" />
> > >
> > > <field name="Content_en" type="text_en" indexed="true" stored="true"
> > > required="false" />
> > > ...
> > >
> > >
> > > The task that I want to perform is the following:
> > > Given the unique reference of a document of type A, I want to find 
> > > the documents of data type B and in a fixed time interval, that 
> > > have the most similar content.
> > > Here the first questions:
> > >
> > >   1.  Which is the best solr request to perform this task?
> > >   2.  Is there a parameter that allows me to restrict the corpus 
> > > of documents that are analyzed for the return of similar contents? 
> > > it should be noted that this corpus of documents may not contain 
> > > the initial document from which I am starting Initially I thought 
> > > about using the "mlt" endpoint, but since there was no parameter 
> > > in the documentation that would allow me to select the shard on 
> > > which to direct the query (I absolutely need it, otherwise I risk 
> > > putting a strain on my cluster), I opted to use the "select" 
> > > endpoint, with the
> > "mlt"
> > > parameter set to true, and the "shards" parameter.
> > > Those are the parameters that I am using:
> > >
> > >   *   q: "UniqueReference:doc_id"
> > >   *   fq: "(DocumentDate:[2022-01-22T00:00:00Z TO 2022-01-26T00:00:00Z]
> > > AND DataType:B) OR (UniqueReference:doc_id)"
> > >   *   mlt: true
> > >   *   mlt.fl: "Content"
> > >   *   shards: "shard_202201"
> > > I realize that the "fq" parameter is used in a bizarre way. In 
> > > theory it should be aimed at the documents of the main query (in 
> > > my case the source document). It is an attempt to solve problem 
> > > (2) (which didn't work, actually).
> > > Anyway, my doubts are not limited to this. What really surprises 
> > > me is the structure of the response that Solr returns to me.
> > > The content of response looks like this:
> > > {
> > > "response" : {
> > > "docs" : [],
> > > ...
> > > }
> > >                 "moreLikeThis" : ...
> > >                 }
> > > The weird stuff appear in the "moreLikeThis" part. Sometimes Solr 
> > > is returning me a list, other times a dictionary. Repeating the 
> > > same call several times the two possibilities occur repeatedly, 
> > > apparently without a logical pattern, and I have not been able to understand why.
> > > And to be precise, in both cases the documents contained in the 
> > > answer are not necessarily of data type B, as requested by me with the "fq"
> > parameter.
> > > In the "dictionary" case, there is only one key, which is the 
> > > UniqueReference of the source document and the corresponding value 
> > > are similar documents.
> > > In the "list" case, the second element contains the required 
> > > documents So, here is the last question:
> > >
> > >   1.  I am perfectly aware that I am lost, therefore, what I'm missing?
> > > I thank everyone for the attention you have dedicated to me. 
> > > Greetings from Italy.
> > > I'm available for clarifications,
> > >
> > > Marco
> > >
> > >
> >
>

Re: Question regarding the MoreLikeThis features

Posted by Alessandro Benedetti <a....@sease.io>.
Hi Marco,
I have been working for a long time on the Apache Lucene More Like This
component and its integration in Apache Solr as a committer.
Let me try to summarise a bit how it works to help you with your use case.
You can find benefits from a presentation I gave in Tokyo for the Open
Source summit in 2017(
https://www.slideshare.net/SeaseLtd/how-the-lucene-more-like-this-works)
and a London Lucene/Solr meetup from 2019 (
https://www.youtube.com/watch?v=jkaj89XwHHw partial).

First of all, the More Like This implementation is at Lucene level:

   - it takes as input a document Id(that is fetched from the index) or a
   stream of text
   - it takes various parameters regarding the fields to use and the
   minimum frequencies to consider
   - uses TF/IDF to assign a score to each term of the original
document *(considering
   the whole local index as the corpus)*
   - return *a query with the list of terms*, potentially boosted by the
   term importance (it's the boost parameter)

In the past, I produced plenty of work to restructure the module, to
support BM25 for scoring the terms and make the component more
readable/maintainable but didn't get enough traction, and the contribution
stalled (I am open to resuscitating it if there's interest).

In Apache Solr you have the More Like This integrated in three ways (
https://issues.apache.org/jira/browse/SOLR-13172 adds some details):
1) query component (it is what happens when you add mlt=true as a request
parameter and include the mlt component in a request handler)
This calculates and runs a MLT query for each document in the search
results.
2) request handler
3) query parser -> just build the MLT query

Coming back to your question:

*1.  Which is the best solr request to perform this task?*
If you want to find documents from a document ID, the best option is to use
the* !mlt query parser*, it's compatible with SolrCloud.
The request is processed by a Solr node, it fetches the document
potentially from another shard, and then it builds *locally* the mlt query.
Bear in mind the entire local shard is used for the document frequencies
calculations, if your shards are skewed you should use global IDF.
Using this approach you can build your final query as you like, so you can
add additional boolean clauses and filters on top of the MLT query.

*  2.  Is there a parameter that allows me to restrict the corpus of
documents that are analyzed for the return of similar contents? it should
be noted that this corpus of documents may not contain the initial document
from which I am starting*
No, at the moment the entire corpus of the local core that is processing
the request, is used to calculate the importance of the terms in the seed
document.
As I said before, it's not that important if the document is there locally
or not, as the more like this query parser is SolrCloud compatible.
But if you want to limit the corpus for the term importance calculations,
some Lucene customizations are needed.

Hope this helps,

Cheers

--------------------------
Alessandro Benedetti
Apache Lucene/Solr PMC member and Committer
Director, R&D Software Engineer, Search Consultant

www.sease.io


On Mon, 14 Mar 2022 at 22:05, Tim Casey <tc...@gmail.com> wrote:

> Hi,
>
> > Regarding the specific problem on the existence of a specific parameter
> to restrict the corpus of documents that are analyzed for the return of
> similar contents
>
> If you can get this to be a query, and one which might be ordered in a
> useful way, then you are very likely to see what you need in the top 500
> results.  This would be enough for most usage.
> The 'likely' would need to be computed and measured as you produce results.
>
>
> In any event, to restrict the corpus you build a query bit set and use that
> as a filter.  This is fairly easy to code so you can see the results and
> give yourself a way to experiment on what you would do, before deciding
> how/what to do any one particular way.
>
> Or, you directly query and allow solr to do the needed computations within
> each shard.  At this point, I would recommend people who are more versed in
> solr specifics for this kind of computation.
>
> On Mon, Mar 14, 2022 at 12:56 AM Marco D'Ambra <m....@volocom.it>
> wrote:
>
> > Hi Tim,
> >
> > thank you very much for the answer, full of useful advice.
> > I will try to put into practice what you told me to improve the output of
> > the calls.
> > Regarding the specific problem on the existence of a specific parameter
> to
> > restrict the corpus of documents that are analyzed for the return of
> > similar contents, I must admit that I have not yet figured out how to
> > proceed.
> >
> > Thank you very much and have a nice day,
> >
> > Marco
> >
> > -----Original Message-----
> > From: Tim Casey <tc...@gmail.com>
> > Sent: giovedì 10 marzo 2022 19:51
> > To: users@solr.apache.org
> > Subject: Re: Question regarding the MoreLikeThis features
> >
> > Marco,
> >
> > Finding 'similar' documents will end up being weighted by document
> length.
> > I would recommend, at the point of indexing, also indexing an ordered
> > token set of the first 256, 1024 up to around 5k tokens (depending on
> > document lengths).  What this does is allow a vector to vector normalized
> > comparison.  You could then query for similar possibile documents
> directly
> > and build a normalized vector with respect to the query document.
> >
> > Normalizing schemes in something like an inverted index will tend to
> > weight the lower token count documents over higher token count documents.
> > So the above is an attempt to get at a normalized and comparable view
> > between documents independent of size.  Next you end up normalizing by
> the
> > inverse of a commonality.  That is, a more common token is weighted lower
> > than a least common token.  (I would also discount tokens which have a
> raw
> > frequency below 5.). At the point you have a normalized vector, you can
> use
> > that to find similarities weighted by more meaningful tokens.
> >
> > tim
> >
> > On Thu, Mar 10, 2022 at 9:18 AM Marco D'Ambra <m....@volocom.it>
> wrote:
> >
> > > Hi all,
> > > This is my first time writing to this mailing list and I would like to
> > > thank you in advance for your attention.
> > > I am writing because I am having problems using the "MoreLikeThis"
> > > features.
> > > I am working in a Solr cluster (version 8.11.1) consisting of multiple
> > > nodes, each of which contains multiple shards.
> > >
> > > It is a quite big cluster and data is sharded using implicit routing
> > > and documents are distributed by date on monthly shards.
> > >
> > > Here are the fields that I'm using:
> > >
> > >   *   UniqueReference: the unique reference of a document
> > >   *   DocumentDate: the date of a document (in the standar Solr format)
> > >   *   DataType: the data type of the document (let's say that can be A
> or
> > > B)
> > >   *   Content: the content of a document (a string)
> > > Here is what my managed schema looks like ...
> > > <field name="UniqueReference" type="string" indexed="true"
> stored="true"
> > > required="true" />
> > >
> > > <field name="DocumentDate" type="pdate" indexed="true" stored="false"
> > > required="true" />
> > >
> > > <field name="DataType" type="string" indexed="true" stored="false"
> > > required="true" />
> > >
> > > <field name="Content_en" type="text_en" indexed="true" stored="true"
> > > required="false" />
> > > ...
> > >
> > >
> > > The task that I want to perform is the following:
> > > Given the unique reference of a document of type A, I want to find the
> > > documents of data type B and in a fixed time interval, that have the
> > > most similar content.
> > > Here the first questions:
> > >
> > >   1.  Which is the best solr request to perform this task?
> > >   2.  Is there a parameter that allows me to restrict the corpus of
> > > documents that are analyzed for the return of similar contents? it
> > > should be noted that this corpus of documents may not contain the
> > > initial document from which I am starting Initially I thought about
> > > using the "mlt" endpoint, but since there was no parameter in the
> > > documentation that would allow me to select the shard on which to
> > > direct the query (I absolutely need it, otherwise I risk putting a
> > > strain on my cluster), I opted to use the "select" endpoint, with the
> > "mlt"
> > > parameter set to true, and the "shards" parameter.
> > > Those are the parameters that I am using:
> > >
> > >   *   q: "UniqueReference:doc_id"
> > >   *   fq: "(DocumentDate:[2022-01-22T00:00:00Z TO 2022-01-26T00:00:00Z]
> > > AND DataType:B) OR (UniqueReference:doc_id)"
> > >   *   mlt: true
> > >   *   mlt.fl: "Content"
> > >   *   shards: "shard_202201"
> > > I realize that the "fq" parameter is used in a bizarre way. In theory
> > > it should be aimed at the documents of the main query (in my case the
> > > source document). It is an attempt to solve problem (2) (which didn't
> > > work, actually).
> > > Anyway, my doubts are not limited to this. What really surprises me is
> > > the structure of the response that Solr returns to me.
> > > The content of response looks like this:
> > > {
> > > "response" : {
> > > "docs" : [],
> > > ...
> > > }
> > >                 "moreLikeThis" : ...
> > >                 }
> > > The weird stuff appear in the "moreLikeThis" part. Sometimes Solr is
> > > returning me a list, other times a dictionary. Repeating the same call
> > > several times the two possibilities occur repeatedly, apparently
> > > without a logical pattern, and I have not been able to understand why.
> > > And to be precise, in both cases the documents contained in the answer
> > > are not necessarily of data type B, as requested by me with the "fq"
> > parameter.
> > > In the "dictionary" case, there is only one key, which is the
> > > UniqueReference of the source document and the corresponding value are
> > > similar documents.
> > > In the "list" case, the second element contains the required documents
> > > So, here is the last question:
> > >
> > >   1.  I am perfectly aware that I am lost, therefore, what I'm missing?
> > > I thank everyone for the attention you have dedicated to me. Greetings
> > > from Italy.
> > > I'm available for clarifications,
> > >
> > > Marco
> > >
> > >
> >
>

Re: Question regarding the MoreLikeThis features

Posted by Tim Casey <tc...@gmail.com>.
Hi,

> Regarding the specific problem on the existence of a specific parameter
to restrict the corpus of documents that are analyzed for the return of
similar contents

If you can get this to be a query, and one which might be ordered in a
useful way, then you are very likely to see what you need in the top 500
results.  This would be enough for most usage.
The 'likely' would need to be computed and measured as you produce results.


In any event, to restrict the corpus you build a query bit set and use that
as a filter.  This is fairly easy to code so you can see the results and
give yourself a way to experiment on what you would do, before deciding
how/what to do any one particular way.

Or, you directly query and allow solr to do the needed computations within
each shard.  At this point, I would recommend people who are more versed in
solr specifics for this kind of computation.

On Mon, Mar 14, 2022 at 12:56 AM Marco D'Ambra <m....@volocom.it> wrote:

> Hi Tim,
>
> thank you very much for the answer, full of useful advice.
> I will try to put into practice what you told me to improve the output of
> the calls.
> Regarding the specific problem on the existence of a specific parameter to
> restrict the corpus of documents that are analyzed for the return of
> similar contents, I must admit that I have not yet figured out how to
> proceed.
>
> Thank you very much and have a nice day,
>
> Marco
>
> -----Original Message-----
> From: Tim Casey <tc...@gmail.com>
> Sent: giovedì 10 marzo 2022 19:51
> To: users@solr.apache.org
> Subject: Re: Question regarding the MoreLikeThis features
>
> Marco,
>
> Finding 'similar' documents will end up being weighted by document length.
> I would recommend, at the point of indexing, also indexing an ordered
> token set of the first 256, 1024 up to around 5k tokens (depending on
> document lengths).  What this does is allow a vector to vector normalized
> comparison.  You could then query for similar possibile documents directly
> and build a normalized vector with respect to the query document.
>
> Normalizing schemes in something like an inverted index will tend to
> weight the lower token count documents over higher token count documents.
> So the above is an attempt to get at a normalized and comparable view
> between documents independent of size.  Next you end up normalizing by the
> inverse of a commonality.  That is, a more common token is weighted lower
> than a least common token.  (I would also discount tokens which have a raw
> frequency below 5.). At the point you have a normalized vector, you can use
> that to find similarities weighted by more meaningful tokens.
>
> tim
>
> On Thu, Mar 10, 2022 at 9:18 AM Marco D'Ambra <m....@volocom.it> wrote:
>
> > Hi all,
> > This is my first time writing to this mailing list and I would like to
> > thank you in advance for your attention.
> > I am writing because I am having problems using the "MoreLikeThis"
> > features.
> > I am working in a Solr cluster (version 8.11.1) consisting of multiple
> > nodes, each of which contains multiple shards.
> >
> > It is a quite big cluster and data is sharded using implicit routing
> > and documents are distributed by date on monthly shards.
> >
> > Here are the fields that I'm using:
> >
> >   *   UniqueReference: the unique reference of a document
> >   *   DocumentDate: the date of a document (in the standar Solr format)
> >   *   DataType: the data type of the document (let's say that can be A or
> > B)
> >   *   Content: the content of a document (a string)
> > Here is what my managed schema looks like ...
> > <field name="UniqueReference" type="string" indexed="true" stored="true"
> > required="true" />
> >
> > <field name="DocumentDate" type="pdate" indexed="true" stored="false"
> > required="true" />
> >
> > <field name="DataType" type="string" indexed="true" stored="false"
> > required="true" />
> >
> > <field name="Content_en" type="text_en" indexed="true" stored="true"
> > required="false" />
> > ...
> >
> >
> > The task that I want to perform is the following:
> > Given the unique reference of a document of type A, I want to find the
> > documents of data type B and in a fixed time interval, that have the
> > most similar content.
> > Here the first questions:
> >
> >   1.  Which is the best solr request to perform this task?
> >   2.  Is there a parameter that allows me to restrict the corpus of
> > documents that are analyzed for the return of similar contents? it
> > should be noted that this corpus of documents may not contain the
> > initial document from which I am starting Initially I thought about
> > using the "mlt" endpoint, but since there was no parameter in the
> > documentation that would allow me to select the shard on which to
> > direct the query (I absolutely need it, otherwise I risk putting a
> > strain on my cluster), I opted to use the "select" endpoint, with the
> "mlt"
> > parameter set to true, and the "shards" parameter.
> > Those are the parameters that I am using:
> >
> >   *   q: "UniqueReference:doc_id"
> >   *   fq: "(DocumentDate:[2022-01-22T00:00:00Z TO 2022-01-26T00:00:00Z]
> > AND DataType:B) OR (UniqueReference:doc_id)"
> >   *   mlt: true
> >   *   mlt.fl: "Content"
> >   *   shards: "shard_202201"
> > I realize that the "fq" parameter is used in a bizarre way. In theory
> > it should be aimed at the documents of the main query (in my case the
> > source document). It is an attempt to solve problem (2) (which didn't
> > work, actually).
> > Anyway, my doubts are not limited to this. What really surprises me is
> > the structure of the response that Solr returns to me.
> > The content of response looks like this:
> > {
> > "response" : {
> > "docs" : [],
> > ...
> > }
> >                 "moreLikeThis" : ...
> >                 }
> > The weird stuff appear in the "moreLikeThis" part. Sometimes Solr is
> > returning me a list, other times a dictionary. Repeating the same call
> > several times the two possibilities occur repeatedly, apparently
> > without a logical pattern, and I have not been able to understand why.
> > And to be precise, in both cases the documents contained in the answer
> > are not necessarily of data type B, as requested by me with the "fq"
> parameter.
> > In the "dictionary" case, there is only one key, which is the
> > UniqueReference of the source document and the corresponding value are
> > similar documents.
> > In the "list" case, the second element contains the required documents
> > So, here is the last question:
> >
> >   1.  I am perfectly aware that I am lost, therefore, what I'm missing?
> > I thank everyone for the attention you have dedicated to me. Greetings
> > from Italy.
> > I'm available for clarifications,
> >
> > Marco
> >
> >
>

RE: Question regarding the MoreLikeThis features

Posted by Marco D'Ambra <m....@volocom.it>.
Hi Tim,

thank you very much for the answer, full of useful advice.
I will try to put into practice what you told me to improve the output of the calls.
Regarding the specific problem on the existence of a specific parameter to restrict the corpus of documents that are analyzed for the return of similar contents, I must admit that I have not yet figured out how to proceed.

Thank you very much and have a nice day,

Marco

-----Original Message-----
From: Tim Casey <tc...@gmail.com> 
Sent: giovedì 10 marzo 2022 19:51
To: users@solr.apache.org
Subject: Re: Question regarding the MoreLikeThis features

Marco,

Finding 'similar' documents will end up being weighted by document length.
I would recommend, at the point of indexing, also indexing an ordered token set of the first 256, 1024 up to around 5k tokens (depending on document lengths).  What this does is allow a vector to vector normalized comparison.  You could then query for similar possibile documents directly and build a normalized vector with respect to the query document.

Normalizing schemes in something like an inverted index will tend to weight the lower token count documents over higher token count documents.  So the above is an attempt to get at a normalized and comparable view between documents independent of size.  Next you end up normalizing by the inverse of a commonality.  That is, a more common token is weighted lower than a least common token.  (I would also discount tokens which have a raw frequency below 5.). At the point you have a normalized vector, you can use that to find similarities weighted by more meaningful tokens.

tim

On Thu, Mar 10, 2022 at 9:18 AM Marco D'Ambra <m....@volocom.it> wrote:

> Hi all,
> This is my first time writing to this mailing list and I would like to 
> thank you in advance for your attention.
> I am writing because I am having problems using the "MoreLikeThis"
> features.
> I am working in a Solr cluster (version 8.11.1) consisting of multiple 
> nodes, each of which contains multiple shards.
>
> It is a quite big cluster and data is sharded using implicit routing 
> and documents are distributed by date on monthly shards.
>
> Here are the fields that I'm using:
>
>   *   UniqueReference: the unique reference of a document
>   *   DocumentDate: the date of a document (in the standar Solr format)
>   *   DataType: the data type of the document (let's say that can be A or
> B)
>   *   Content: the content of a document (a string)
> Here is what my managed schema looks like ...
> <field name="UniqueReference" type="string" indexed="true" stored="true"
> required="true" />
>
> <field name="DocumentDate" type="pdate" indexed="true" stored="false"
> required="true" />
>
> <field name="DataType" type="string" indexed="true" stored="false"
> required="true" />
>
> <field name="Content_en" type="text_en" indexed="true" stored="true"
> required="false" />
> ...
>
>
> The task that I want to perform is the following:
> Given the unique reference of a document of type A, I want to find the 
> documents of data type B and in a fixed time interval, that have the 
> most similar content.
> Here the first questions:
>
>   1.  Which is the best solr request to perform this task?
>   2.  Is there a parameter that allows me to restrict the corpus of 
> documents that are analyzed for the return of similar contents? it 
> should be noted that this corpus of documents may not contain the 
> initial document from which I am starting Initially I thought about 
> using the "mlt" endpoint, but since there was no parameter in the 
> documentation that would allow me to select the shard on which to 
> direct the query (I absolutely need it, otherwise I risk putting a 
> strain on my cluster), I opted to use the "select" endpoint, with the "mlt"
> parameter set to true, and the "shards" parameter.
> Those are the parameters that I am using:
>
>   *   q: "UniqueReference:doc_id"
>   *   fq: "(DocumentDate:[2022-01-22T00:00:00Z TO 2022-01-26T00:00:00Z]
> AND DataType:B) OR (UniqueReference:doc_id)"
>   *   mlt: true
>   *   mlt.fl: "Content"
>   *   shards: "shard_202201"
> I realize that the "fq" parameter is used in a bizarre way. In theory 
> it should be aimed at the documents of the main query (in my case the 
> source document). It is an attempt to solve problem (2) (which didn't 
> work, actually).
> Anyway, my doubts are not limited to this. What really surprises me is 
> the structure of the response that Solr returns to me.
> The content of response looks like this:
> {
> "response" : {
> "docs" : [],
> ...
> }
>                 "moreLikeThis" : ...
>                 }
> The weird stuff appear in the "moreLikeThis" part. Sometimes Solr is 
> returning me a list, other times a dictionary. Repeating the same call 
> several times the two possibilities occur repeatedly, apparently 
> without a logical pattern, and I have not been able to understand why.
> And to be precise, in both cases the documents contained in the answer 
> are not necessarily of data type B, as requested by me with the "fq" parameter.
> In the "dictionary" case, there is only one key, which is the 
> UniqueReference of the source document and the corresponding value are 
> similar documents.
> In the "list" case, the second element contains the required documents 
> So, here is the last question:
>
>   1.  I am perfectly aware that I am lost, therefore, what I'm missing?
> I thank everyone for the attention you have dedicated to me. Greetings 
> from Italy.
> I'm available for clarifications,
>
> Marco
>
>

Re: Question regarding the MoreLikeThis features

Posted by Tim Casey <tc...@gmail.com>.
Marco,

Finding 'similar' documents will end up being weighted by document length.
I would recommend, at the point of indexing, also indexing an ordered token
set of the first 256, 1024 up to around 5k tokens (depending on document
lengths).  What this does is allow a vector to vector normalized
comparison.  You could then query for similar possibile documents directly
and build a normalized vector with respect to the query document.

Normalizing schemes in something like an inverted index will tend to weight
the lower token count documents over higher token count documents.  So the
above is an attempt to get at a normalized and comparable view between
documents independent of size.  Next you end up normalizing by the inverse
of a commonality.  That is, a more common token is weighted lower than a
least common token.  (I would also discount tokens which have a raw
frequency below 5.). At the point you have a normalized vector, you can use
that to find similarities weighted by more meaningful tokens.

tim

On Thu, Mar 10, 2022 at 9:18 AM Marco D'Ambra <m....@volocom.it> wrote:

> Hi all,
> This is my first time writing to this mailing list and I would like to
> thank you in advance for your attention.
> I am writing because I am having problems using the "MoreLikeThis"
> features.
> I am working in a Solr cluster (version 8.11.1) consisting of multiple
> nodes, each of which contains multiple shards.
>
> It is a quite big cluster and data is sharded using implicit routing and
> documents are distributed by date on monthly shards.
>
> Here are the fields that I'm using:
>
>   *   UniqueReference: the unique reference of a document
>   *   DocumentDate: the date of a document (in the standar Solr format)
>   *   DataType: the data type of the document (let's say that can be A or
> B)
>   *   Content: the content of a document (a string)
> Here is what my managed schema looks like
> ...
> <field name="UniqueReference" type="string" indexed="true" stored="true"
> required="true" />
>
> <field name="DocumentDate" type="pdate" indexed="true" stored="false"
> required="true" />
>
> <field name="DataType" type="string" indexed="true" stored="false"
> required="true" />
>
> <field name="Content_en" type="text_en" indexed="true" stored="true"
> required="false" />
> ...
>
>
> The task that I want to perform is the following:
> Given the unique reference of a document of type A, I want to find the
> documents of data type B and in a fixed time interval, that have the most
> similar content.
> Here the first questions:
>
>   1.  Which is the best solr request to perform this task?
>   2.  Is there a parameter that allows me to restrict the corpus of
> documents that are analyzed for the return of similar contents? it should
> be noted that this corpus of documents may not contain the initial document
> from which I am starting
> Initially I thought about using the "mlt" endpoint, but since there was no
> parameter in the documentation that would allow me to select the shard on
> which to direct the query (I absolutely need it, otherwise I risk putting a
> strain on my cluster), I opted to use the "select" endpoint, with the "mlt"
> parameter set to true, and the "shards" parameter.
> Those are the parameters that I am using:
>
>   *   q: "UniqueReference:doc_id"
>   *   fq: "(DocumentDate:[2022-01-22T00:00:00Z TO 2022-01-26T00:00:00Z]
> AND DataType:B) OR (UniqueReference:doc_id)"
>   *   mlt: true
>   *   mlt.fl: "Content"
>   *   shards: "shard_202201"
> I realize that the "fq" parameter is used in a bizarre way. In theory it
> should be aimed at the documents of the main query (in my case the source
> document). It is an attempt to solve problem (2) (which didn't work,
> actually).
> Anyway, my doubts are not limited to this. What really surprises me is the
> structure of the response that Solr returns to me.
> The content of response looks like this:
> {
> "response" : {
> "docs" : [],
> ...
> }
>                 "moreLikeThis" : ...
>                 }
> The weird stuff appear in the "moreLikeThis" part. Sometimes Solr is
> returning me a list, other times a dictionary. Repeating the same call
> several times the two possibilities occur repeatedly, apparently without a
> logical pattern, and I have not been able to understand why.
> And to be precise, in both cases the documents contained in the answer are
> not necessarily of data type B, as requested by me with the "fq" parameter.
> In the "dictionary" case, there is only one key, which is the
> UniqueReference of the source document and the corresponding value are
> similar documents.
> In the "list" case, the second element contains the required documents
> So, here is the last question:
>
>   1.  I am perfectly aware that I am lost, therefore, what I'm missing?
> I thank everyone for the attention you have dedicated to me. Greetings
> from Italy.
> I'm available for clarifications,
>
> Marco
>
>