You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by Tom Tailor <al...@gmail.com> on 2023/04/12 10:21:43 UTC
How to use MorelikeThis with duplicates
Hi all
I want to build a recommender using Solr MoreLikeThis. I work on
bibliographic data I.e. books. I have multiple records of different
editions of the same book. For a given book MLT returns all different
editions of the book this is not new content from the users point of view.
I can not deduplicate the records because the different editions are
relevant for other applications.
Is it possible to circumvent this? I could use the books title which is the
same across all editions to filter duplicates from the MLT results
Thanks for your help
Re: How to use MorelikeThis with duplicates
Posted by Dave <ha...@gmail.com>.
The recent flag is super clever, and you can use it on other applications/situations as well. I would do that in a heartbeat assuming you can reindex your data set quickly
> On Apr 12, 2023, at 10:49 AM, Alessandro Benedetti <a....@sease.io> wrote:
>
> Following up on Mikhail good insights,
> I would probably recommend using the More Like This Query Parser followed
> by grouping/field collapsing on a field.
> It should solve your problem!
>
> If your requirements are more advanced feel free to let us know!
>
> Cheers
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benedetti@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>
>
>> On Wed, 12 Apr 2023 at 13:15, Mikhail Khludnev <mk...@apache.org> wrote:
>>
>> Hello Tom.
>> It's not clear which kind of MLT you are referring to: handler, queryparser
>> or component .
>> Generally there are two options for deduplication:
>> - query time: filed grouping or field collapsing
>> - index time:
>> - mlt query might be limited to parents with titles and children might
>> carry editions with dates and so one
>> - or mlt query can be filtered to the recent edition only for every
>> title, thus recent-flag should be set during indexing and then used by
>> filter.
>>
>>> On Wed, Apr 12, 2023 at 1:22 PM Tom Tailor <al...@gmail.com> wrote:
>>>
>>> Hi all
>>>
>>>
>>>
>>> I want to build a recommender using Solr MoreLikeThis. I work on
>>> bibliographic data I.e. books. I have multiple records of different
>>> editions of the same book. For a given book MLT returns all different
>>> editions of the book this is not new content from the users point of
>> view.
>>> I can not deduplicate the records because the different editions are
>>> relevant for other applications.
>>>
>>>
>>>
>>> Is it possible to circumvent this? I could use the books title which is
>> the
>>> same across all editions to filter duplicates from the MLT results
>>>
>>>
>>>
>>> Thanks for your help
>>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> https://t.me/MUST_SEARCH
>> A caveat: Cyrillic!
>>
Re: How to use MorelikeThis with duplicates
Posted by Alessandro Benedetti <a....@sease.io>.
Following up on Mikhail good insights,
I would probably recommend using the More Like This Query Parser followed
by grouping/field collapsing on a field.
It should solve your problem!
If your requirements are more advanced feel free to let us know!
Cheers
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*
e-mail: a.benedetti@sease.io
*Sease* - Information Retrieval Applied
Consulting | Training | Open Source
Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>
On Wed, 12 Apr 2023 at 13:15, Mikhail Khludnev <mk...@apache.org> wrote:
> Hello Tom.
> It's not clear which kind of MLT you are referring to: handler, queryparser
> or component .
> Generally there are two options for deduplication:
> - query time: filed grouping or field collapsing
> - index time:
> - mlt query might be limited to parents with titles and children might
> carry editions with dates and so one
> - or mlt query can be filtered to the recent edition only for every
> title, thus recent-flag should be set during indexing and then used by
> filter.
>
> On Wed, Apr 12, 2023 at 1:22 PM Tom Tailor <al...@gmail.com> wrote:
>
> > Hi all
> >
> >
> >
> > I want to build a recommender using Solr MoreLikeThis. I work on
> > bibliographic data I.e. books. I have multiple records of different
> > editions of the same book. For a given book MLT returns all different
> > editions of the book this is not new content from the users point of
> view.
> > I can not deduplicate the records because the different editions are
> > relevant for other applications.
> >
> >
> >
> > Is it possible to circumvent this? I could use the books title which is
> the
> > same across all editions to filter duplicates from the MLT results
> >
> >
> >
> > Thanks for your help
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> https://t.me/MUST_SEARCH
> A caveat: Cyrillic!
>
Re: How to use MorelikeThis with duplicates
Posted by Mikhail Khludnev <mk...@apache.org>.
Hello Tom.
It's not clear which kind of MLT you are referring to: handler, queryparser
or component .
Generally there are two options for deduplication:
- query time: filed grouping or field collapsing
- index time:
- mlt query might be limited to parents with titles and children might
carry editions with dates and so one
- or mlt query can be filtered to the recent edition only for every
title, thus recent-flag should be set during indexing and then used by
filter.
On Wed, Apr 12, 2023 at 1:22 PM Tom Tailor <al...@gmail.com> wrote:
> Hi all
>
>
>
> I want to build a recommender using Solr MoreLikeThis. I work on
> bibliographic data I.e. books. I have multiple records of different
> editions of the same book. For a given book MLT returns all different
> editions of the book this is not new content from the users point of view.
> I can not deduplicate the records because the different editions are
> relevant for other applications.
>
>
>
> Is it possible to circumvent this? I could use the books title which is the
> same across all editions to filter duplicates from the MLT results
>
>
>
> Thanks for your help
>
--
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!