You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Susheel Kumar <su...@gmail.com> on 2017/08/30 13:40:08 UTC

Different ideas for querying unique and non-unique records

Hello,

I am looking for different ideas/suggestions to solve the use case am
working on.

We have couple of fields in schema along with id, business_email and
personal_email.  We need to return all records based on unique business and
personal email's.

The criteria for unique records is either of business or personal email has
not repeated again in other records.
The criteria for non-unique records is if any of the business or personal
email has occurred/repeats in other records then all those records are
non-unique.
E.g considering below documents.
- for unique records below only id=1 should be returned (since john.doe is
not present in any other records personal or business email)
- non unique records, below id=2,3 should be returned (since isabel.dora is
present in multiple records. doesn't matter if it is present in business or
personal email)

Documents
===
{id:1,business_email_s:john.doe@abc.com,personal_email_s:john.doe@abc.com}
{id:2,business_email_s:isabel.dora@abc.com}
{id:3,personal_email_s:isabel.dora@abc.com}

I am able to solve this using Streaming expression query but not sure if
performance will become an bottleneck as the streaming expression is quite
big. So looking for
different ideas like using de-dupe or during ingestion/pre-process etc.
without impacting performance much.

Thanks,
Susheel

Re: Different ideas for querying unique and non-unique records

Posted by Rick Leir <rl...@leirtech.com>.

Susheel, Just a guess, but carrot2.org might be useful. But it might be overkill. Cheers -- Rick

On August 30, 2017 7:40:08 AM MDT, Susheel Kumar <su...@gmail.com> wrote:
>Hello,
>
>I am looking for different ideas/suggestions to solve the use case am
>working on.
>
>We have couple of fields in schema along with id, business_email and
>personal_email.  We need to return all records based on unique business
>and
>personal email's.
>
>The criteria for unique records is either of business or personal email
>has
>not repeated again in other records.
>The criteria for non-unique records is if any of the business or
>personal
>email has occurred/repeats in other records then all those records are
>non-unique.
>E.g considering below documents.
>- for unique records below only id=1 should be returned (since john.doe
>is
>not present in any other records personal or business email)
>- non unique records, below id=2,3 should be returned (since
>isabel.dora is
>present in multiple records. doesn't matter if it is present in
>business or
>personal email)
>
>Documents
>===
>{id:1,business_email_s:john.doe@abc.com,personal_email_s:john.doe@abc.com}
>{id:2,business_email_s:isabel.dora@abc.com}
>{id:3,personal_email_s:isabel.dora@abc.com}
>
>I am able to solve this using Streaming expression query but not sure
>if
>performance will become an bottleneck as the streaming expression is
>quite
>big. So looking for
>different ideas like using de-dupe or during ingestion/pre-process etc.
>without impacting performance much.
>
>Thanks,
>Susheel

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com