You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Ali Nazemian <al...@gmail.com> on 2014/10/14 10:38:26 UTC

mark solr documents as duplicates on hashing the combination of some fields

Dear all,
Hi,
I was wondering how can I mark some documents as duplicate (just marking
for future usage not deleting) based on the hash combination of some
fields? Suppose I have 2 fields name "url" and "title" I want to create
hash based on url+title and send it to another field name "signature". If I
do that using solr dedup, it will be resulted to deleting duplicate
documents! So it is not applicable for my situation. Thank you very much.
Best regards.

-- 
A.Nazemian

Re: mark solr documents as duplicates on hashing the combination of some fields

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
This is the "dark art" knowledge. I've updated the Reference Guide
comment with the request to have this text included, but it would also
be nice to have it as part of the Javadoc for the Factory or the URP
itself. Maybe WIKI as well. I can see not getting this part causing
somebody a lot of headache.

Regards,
   Alex.

On 22 October 2014 14:17, Chris Hostetter <ho...@fucit.org> wrote:
> the atomic updates are processed as part of the
> DistributedUpdateProcessor (so they execute on the leader and work with
> optimistic concurrency) but that means if you have the
> SignatureUpdateProcessorFactory configured before the
> DistributedUpdateProcessorFactory it could compute a signature based on
> the raw doc you send (with the updatecommands) instead of the "real" doc
> with the updates applied.
>
> for a situation where you want the signatureField to *be* the uniqueKey,
> then you kind of have to put SignatureUpdateProcessorFactory before
> DistributedUpdateProcessorFactory

Re: mark solr documents as duplicates on hashing the combination of some fields

Posted by Chris Hostetter <ho...@fucit.org>.
: I meant signature will be broken. For example suppose the destination of
: hash function for signature fields are "sig". After each partial update it
: becomes: "0000000000"!

details please.

how are you configuring your update processor chain? what does your schema 
look like? what types of atomic updates are you using?

in general atomic updates require that all source fields be stored - so 
you might be having problems if the fields you are trying to hash aren't 
stored.

likewise, the atomic updates are processed as part of the 
DistributedUpdateProcessor (so they execute on the leader and work with 
optimistic concurrency) but that means if you have the 
SignatureUpdateProcessorFactory configured before the 
DistributedUpdateProcessorFactory it could compute a signature based on 
the raw doc you send (with the updatecommands) instead of the "real" doc 
with the updates applied.

for a situation where you want the signatureField to *be* the uniqueKey, 
then you kind of have to put SignatureUpdateProcessorFactory before 
DistributedUpdateProcessorFactory -- but for a situation like yours, you 
need to ensure that SignatureUpdateProcessorFactory comes *after* 
DistributedUpdateProcessorFactory and before the 
RunUpdateProcessorFactory.


-Hoss
http://www.lucidworks.com/

Re: mark solr documents as duplicates on hashing the combination of some fields

Posted by Ali Nazemian <al...@gmail.com>.
I meant signature will be broken. For example suppose the destination of
hash function for signature fields are "sig". After each partial update it
becomes: "0000000000"!

On Wed, Oct 22, 2014 at 2:59 PM, Alexandre Rafalovitch <ar...@gmail.com>
wrote:

> What do you mean by 'useless' specifically on the business level?
>
> Regards,
>      Alex
> On 22/10/2014 7:27 am, "Ali Nazemian" <al...@gmail.com> wrote:
>
> > The problem is when I partially update some fields of document. The
> > signature becomes useless! Even if the updated fields are not included in
> > the signatureField!
> > Regards.
> >
> > On Wed, Oct 22, 2014 at 12:44 AM, Chris Hostetter <
> > hossman_lucene@fucit.org>
> > wrote:
> >
> > >
> > > you can still use the SignatureUpdateProcessorFactory for your usecase,
> > > just don't configure teh signatureField to be the same as your
> uniqueKey
> > > field.
> > >
> > > configure some othe fieldname (ie "signature") instead.
> > >
> > >
> > > : Date: Tue, 14 Oct 2014 12:08:26 +0330
> > > : From: Ali Nazemian <al...@gmail.com>
> > > : Reply-To: solr-user@lucene.apache.org
> > > : To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> > > : Subject: mark solr documents as duplicates on hashing the combination
> > of
> > > some
> > > :     fields
> > > :
> > > : Dear all,
> > > : Hi,
> > > : I was wondering how can I mark some documents as duplicate (just
> > marking
> > > : for future usage not deleting) based on the hash combination of some
> > > : fields? Suppose I have 2 fields name "url" and "title" I want to
> create
> > > : hash based on url+title and send it to another field name
> "signature".
> > > If I
> > > : do that using solr dedup, it will be resulted to deleting duplicate
> > > : documents! So it is not applicable for my situation. Thank you very
> > much.
> > > : Best regards.
> > > :
> > > : --
> > > : A.Nazemian
> > > :
> > >
> > > -Hoss
> > > http://www.lucidworks.com/
> > >
> >
> >
> >
> > --
> > A.Nazemian
> >
>



-- 
A.Nazemian

Re: mark solr documents as duplicates on hashing the combination of some fields

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
What do you mean by 'useless' specifically on the business level?

Regards,
     Alex
On 22/10/2014 7:27 am, "Ali Nazemian" <al...@gmail.com> wrote:

> The problem is when I partially update some fields of document. The
> signature becomes useless! Even if the updated fields are not included in
> the signatureField!
> Regards.
>
> On Wed, Oct 22, 2014 at 12:44 AM, Chris Hostetter <
> hossman_lucene@fucit.org>
> wrote:
>
> >
> > you can still use the SignatureUpdateProcessorFactory for your usecase,
> > just don't configure teh signatureField to be the same as your uniqueKey
> > field.
> >
> > configure some othe fieldname (ie "signature") instead.
> >
> >
> > : Date: Tue, 14 Oct 2014 12:08:26 +0330
> > : From: Ali Nazemian <al...@gmail.com>
> > : Reply-To: solr-user@lucene.apache.org
> > : To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> > : Subject: mark solr documents as duplicates on hashing the combination
> of
> > some
> > :     fields
> > :
> > : Dear all,
> > : Hi,
> > : I was wondering how can I mark some documents as duplicate (just
> marking
> > : for future usage not deleting) based on the hash combination of some
> > : fields? Suppose I have 2 fields name "url" and "title" I want to create
> > : hash based on url+title and send it to another field name "signature".
> > If I
> > : do that using solr dedup, it will be resulted to deleting duplicate
> > : documents! So it is not applicable for my situation. Thank you very
> much.
> > : Best regards.
> > :
> > : --
> > : A.Nazemian
> > :
> >
> > -Hoss
> > http://www.lucidworks.com/
> >
>
>
>
> --
> A.Nazemian
>

Re: mark solr documents as duplicates on hashing the combination of some fields

Posted by Ali Nazemian <al...@gmail.com>.
The problem is when I partially update some fields of document. The
signature becomes useless! Even if the updated fields are not included in
the signatureField!
Regards.

On Wed, Oct 22, 2014 at 12:44 AM, Chris Hostetter <ho...@fucit.org>
wrote:

>
> you can still use the SignatureUpdateProcessorFactory for your usecase,
> just don't configure teh signatureField to be the same as your uniqueKey
> field.
>
> configure some othe fieldname (ie "signature") instead.
>
>
> : Date: Tue, 14 Oct 2014 12:08:26 +0330
> : From: Ali Nazemian <al...@gmail.com>
> : Reply-To: solr-user@lucene.apache.org
> : To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> : Subject: mark solr documents as duplicates on hashing the combination of
> some
> :     fields
> :
> : Dear all,
> : Hi,
> : I was wondering how can I mark some documents as duplicate (just marking
> : for future usage not deleting) based on the hash combination of some
> : fields? Suppose I have 2 fields name "url" and "title" I want to create
> : hash based on url+title and send it to another field name "signature".
> If I
> : do that using solr dedup, it will be resulted to deleting duplicate
> : documents! So it is not applicable for my situation. Thank you very much.
> : Best regards.
> :
> : --
> : A.Nazemian
> :
>
> -Hoss
> http://www.lucidworks.com/
>



-- 
A.Nazemian

Re: mark solr documents as duplicates on hashing the combination of some fields

Posted by Chris Hostetter <ho...@fucit.org>.
you can still use the SignatureUpdateProcessorFactory for your usecase, 
just don't configure teh signatureField to be the same as your uniqueKey 
field.

configure some othe fieldname (ie "signature") instead.


: Date: Tue, 14 Oct 2014 12:08:26 +0330
: From: Ali Nazemian <al...@gmail.com>
: Reply-To: solr-user@lucene.apache.org
: To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
: Subject: mark solr documents as duplicates on hashing the combination of some
:     fields
: 
: Dear all,
: Hi,
: I was wondering how can I mark some documents as duplicate (just marking
: for future usage not deleting) based on the hash combination of some
: fields? Suppose I have 2 fields name "url" and "title" I want to create
: hash based on url+title and send it to another field name "signature". If I
: do that using solr dedup, it will be resulted to deleting duplicate
: documents! So it is not applicable for my situation. Thank you very much.
: Best regards.
: 
: -- 
: A.Nazemian
: 

-Hoss
http://www.lucidworks.com/