You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Tom Tailor <al...@gmail.com> on 2023/04/12 10:21:43 UTC

How to use MorelikeThis with duplicates

Hi all



I want to build a recommender using Solr MoreLikeThis. I work on
bibliographic data I.e. books. I have multiple records of different
editions of the same book.  For a given book MLT returns all different
editions of the book this is not new content from the users point of view.
I can not deduplicate the records because the different editions are
relevant for other applications.



Is it possible to circumvent this? I could use the books title which is the
same across all editions to filter duplicates from the MLT results



Thanks for your help

Re: How to use MorelikeThis with duplicates

Posted by Dave <ha...@gmail.com>.

The recent flag is super clever, and you can use it on other applications/situations as well.  I would do that in a heartbeat assuming you can reindex your data set quickly

> On Apr 12, 2023, at 10:49 AM, Alessandro Benedetti <a....@sease.io> wrote:
> 
> Following up on Mikhail good insights,
> I would probably recommend using the More Like This Query Parser followed
> by grouping/field collapsing on a field.
> It should solve your problem!
> 
> If your requirements are more advanced feel free to let us know!
> 
> Cheers
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
> 
> e-mail: a.benedetti@sease.io
> 
> 
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
> 
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
> 
> 
>> On Wed, 12 Apr 2023 at 13:15, Mikhail Khludnev <mk...@apache.org> wrote:
>> 
>> Hello Tom.
>> It's not clear which kind of MLT you are referring to: handler, queryparser
>> or component .
>> Generally there are two options for deduplication:
>> - query time: filed grouping or field collapsing
>> - index time:
>>  - mlt query might be limited to parents with titles and children might
>> carry editions with dates and so one
>>  - or mlt query can be filtered to the recent edition only for every
>> title, thus recent-flag should be set during indexing and then used by
>> filter.
>> 
>>> On Wed, Apr 12, 2023 at 1:22 PM Tom Tailor <al...@gmail.com> wrote:
>>> 
>>> Hi all
>>> 
>>> 
>>> 
>>> I want to build a recommender using Solr MoreLikeThis. I work on
>>> bibliographic data I.e. books. I have multiple records of different
>>> editions of the same book.  For a given book MLT returns all different
>>> editions of the book this is not new content from the users point of
>> view.
>>> I can not deduplicate the records because the different editions are
>>> relevant for other applications.
>>> 
>>> 
>>> 
>>> Is it possible to circumvent this? I could use the books title which is
>> the
>>> same across all editions to filter duplicates from the MLT results
>>> 
>>> 
>>> 
>>> Thanks for your help
>>> 
>> 
>> 
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> https://t.me/MUST_SEARCH
>> A caveat: Cyrillic!
>>

Re: How to use MorelikeThis with duplicates

Posted by Alessandro Benedetti <a....@sease.io>.

Following up on Mikhail good insights,
I would probably recommend using the More Like This Query Parser followed
by grouping/field collapsing on a field.
It should solve your problem!

If your requirements are more advanced feel free to let us know!

Cheers
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benedetti@sease.io


*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>


On Wed, 12 Apr 2023 at 13:15, Mikhail Khludnev <mk...@apache.org> wrote:

> Hello Tom.
> It's not clear which kind of MLT you are referring to: handler, queryparser
> or component .
> Generally there are two options for deduplication:
> - query time: filed grouping or field collapsing
> - index time:
>   - mlt query might be limited to parents with titles and children might
> carry editions with dates and so one
>   - or mlt query can be filtered to the recent edition only for every
> title, thus recent-flag should be set during indexing and then used by
> filter.
>
> On Wed, Apr 12, 2023 at 1:22 PM Tom Tailor <al...@gmail.com> wrote:
>
> > Hi all
> >
> >
> >
> > I want to build a recommender using Solr MoreLikeThis. I work on
> > bibliographic data I.e. books. I have multiple records of different
> > editions of the same book.  For a given book MLT returns all different
> > editions of the book this is not new content from the users point of
> view.
> > I can not deduplicate the records because the different editions are
> > relevant for other applications.
> >
> >
> >
> > Is it possible to circumvent this? I could use the books title which is
> the
> > same across all editions to filter duplicates from the MLT results
> >
> >
> >
> > Thanks for your help
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> https://t.me/MUST_SEARCH
> A caveat: Cyrillic!
>

Re: How to use MorelikeThis with duplicates

Posted by Mikhail Khludnev <mk...@apache.org>.

Hello Tom.
It's not clear which kind of MLT you are referring to: handler, queryparser
or component .
Generally there are two options for deduplication:
- query time: filed grouping or field collapsing
- index time:
  - mlt query might be limited to parents with titles and children might
carry editions with dates and so one
  - or mlt query can be filtered to the recent edition only for every
title, thus recent-flag should be set during indexing and then used by
filter.

On Wed, Apr 12, 2023 at 1:22 PM Tom Tailor <al...@gmail.com> wrote:

> Hi all
>
>
>
> I want to build a recommender using Solr MoreLikeThis. I work on
> bibliographic data I.e. books. I have multiple records of different
> editions of the same book.  For a given book MLT returns all different
> editions of the book this is not new content from the users point of view.
> I can not deduplicate the records because the different editions are
> relevant for other applications.
>
>
>
> Is it possible to circumvent this? I could use the books title which is the
> same across all editions to filter duplicates from the MLT results
>
>
>
> Thanks for your help
>


-- 
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!