You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Ali Nazemian <al...@gmail.com> on 2014/10/14 10:38:26 UTC
mark solr documents as duplicates on hashing the combination of some fields
Dear all,
Hi,
I was wondering how can I mark some documents as duplicate (just marking
for future usage not deleting) based on the hash combination of some
fields? Suppose I have 2 fields name "url" and "title" I want to create
hash based on url+title and send it to another field name "signature". If I
do that using solr dedup, it will be resulted to deleting duplicate
documents! So it is not applicable for my situation. Thank you very much.
Best regards.
--
A.Nazemian
Re: mark solr documents as duplicates on hashing the combination of
some fields
Posted by Alexandre Rafalovitch <ar...@gmail.com>.
This is the "dark art" knowledge. I've updated the Reference Guide
comment with the request to have this text included, but it would also
be nice to have it as part of the Javadoc for the Factory or the URP
itself. Maybe WIKI as well. I can see not getting this part causing
somebody a lot of headache.
Regards,
Alex.
On 22 October 2014 14:17, Chris Hostetter <ho...@fucit.org> wrote:
> the atomic updates are processed as part of the
> DistributedUpdateProcessor (so they execute on the leader and work with
> optimistic concurrency) but that means if you have the
> SignatureUpdateProcessorFactory configured before the
> DistributedUpdateProcessorFactory it could compute a signature based on
> the raw doc you send (with the updatecommands) instead of the "real" doc
> with the updates applied.
>
> for a situation where you want the signatureField to *be* the uniqueKey,
> then you kind of have to put SignatureUpdateProcessorFactory before
> DistributedUpdateProcessorFactory
Re: mark solr documents as duplicates on hashing the combination of
some fields
Posted by Chris Hostetter <ho...@fucit.org>.
: I meant signature will be broken. For example suppose the destination of
: hash function for signature fields are "sig". After each partial update it
: becomes: "0000000000"!
details please.
how are you configuring your update processor chain? what does your schema
look like? what types of atomic updates are you using?
in general atomic updates require that all source fields be stored - so
you might be having problems if the fields you are trying to hash aren't
stored.
likewise, the atomic updates are processed as part of the
DistributedUpdateProcessor (so they execute on the leader and work with
optimistic concurrency) but that means if you have the
SignatureUpdateProcessorFactory configured before the
DistributedUpdateProcessorFactory it could compute a signature based on
the raw doc you send (with the updatecommands) instead of the "real" doc
with the updates applied.
for a situation where you want the signatureField to *be* the uniqueKey,
then you kind of have to put SignatureUpdateProcessorFactory before
DistributedUpdateProcessorFactory -- but for a situation like yours, you
need to ensure that SignatureUpdateProcessorFactory comes *after*
DistributedUpdateProcessorFactory and before the
RunUpdateProcessorFactory.
-Hoss
http://www.lucidworks.com/
Re: mark solr documents as duplicates on hashing the combination of
some fields
Posted by Ali Nazemian <al...@gmail.com>.
I meant signature will be broken. For example suppose the destination of
hash function for signature fields are "sig". After each partial update it
becomes: "0000000000"!
On Wed, Oct 22, 2014 at 2:59 PM, Alexandre Rafalovitch <ar...@gmail.com>
wrote:
> What do you mean by 'useless' specifically on the business level?
>
> Regards,
> Alex
> On 22/10/2014 7:27 am, "Ali Nazemian" <al...@gmail.com> wrote:
>
> > The problem is when I partially update some fields of document. The
> > signature becomes useless! Even if the updated fields are not included in
> > the signatureField!
> > Regards.
> >
> > On Wed, Oct 22, 2014 at 12:44 AM, Chris Hostetter <
> > hossman_lucene@fucit.org>
> > wrote:
> >
> > >
> > > you can still use the SignatureUpdateProcessorFactory for your usecase,
> > > just don't configure teh signatureField to be the same as your
> uniqueKey
> > > field.
> > >
> > > configure some othe fieldname (ie "signature") instead.
> > >
> > >
> > > : Date: Tue, 14 Oct 2014 12:08:26 +0330
> > > : From: Ali Nazemian <al...@gmail.com>
> > > : Reply-To: solr-user@lucene.apache.org
> > > : To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> > > : Subject: mark solr documents as duplicates on hashing the combination
> > of
> > > some
> > > : fields
> > > :
> > > : Dear all,
> > > : Hi,
> > > : I was wondering how can I mark some documents as duplicate (just
> > marking
> > > : for future usage not deleting) based on the hash combination of some
> > > : fields? Suppose I have 2 fields name "url" and "title" I want to
> create
> > > : hash based on url+title and send it to another field name
> "signature".
> > > If I
> > > : do that using solr dedup, it will be resulted to deleting duplicate
> > > : documents! So it is not applicable for my situation. Thank you very
> > much.
> > > : Best regards.
> > > :
> > > : --
> > > : A.Nazemian
> > > :
> > >
> > > -Hoss
> > > http://www.lucidworks.com/
> > >
> >
> >
> >
> > --
> > A.Nazemian
> >
>
--
A.Nazemian
Re: mark solr documents as duplicates on hashing the combination of
some fields
Posted by Alexandre Rafalovitch <ar...@gmail.com>.
What do you mean by 'useless' specifically on the business level?
Regards,
Alex
On 22/10/2014 7:27 am, "Ali Nazemian" <al...@gmail.com> wrote:
> The problem is when I partially update some fields of document. The
> signature becomes useless! Even if the updated fields are not included in
> the signatureField!
> Regards.
>
> On Wed, Oct 22, 2014 at 12:44 AM, Chris Hostetter <
> hossman_lucene@fucit.org>
> wrote:
>
> >
> > you can still use the SignatureUpdateProcessorFactory for your usecase,
> > just don't configure teh signatureField to be the same as your uniqueKey
> > field.
> >
> > configure some othe fieldname (ie "signature") instead.
> >
> >
> > : Date: Tue, 14 Oct 2014 12:08:26 +0330
> > : From: Ali Nazemian <al...@gmail.com>
> > : Reply-To: solr-user@lucene.apache.org
> > : To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> > : Subject: mark solr documents as duplicates on hashing the combination
> of
> > some
> > : fields
> > :
> > : Dear all,
> > : Hi,
> > : I was wondering how can I mark some documents as duplicate (just
> marking
> > : for future usage not deleting) based on the hash combination of some
> > : fields? Suppose I have 2 fields name "url" and "title" I want to create
> > : hash based on url+title and send it to another field name "signature".
> > If I
> > : do that using solr dedup, it will be resulted to deleting duplicate
> > : documents! So it is not applicable for my situation. Thank you very
> much.
> > : Best regards.
> > :
> > : --
> > : A.Nazemian
> > :
> >
> > -Hoss
> > http://www.lucidworks.com/
> >
>
>
>
> --
> A.Nazemian
>
Re: mark solr documents as duplicates on hashing the combination of
some fields
Posted by Ali Nazemian <al...@gmail.com>.
The problem is when I partially update some fields of document. The
signature becomes useless! Even if the updated fields are not included in
the signatureField!
Regards.
On Wed, Oct 22, 2014 at 12:44 AM, Chris Hostetter <ho...@fucit.org>
wrote:
>
> you can still use the SignatureUpdateProcessorFactory for your usecase,
> just don't configure teh signatureField to be the same as your uniqueKey
> field.
>
> configure some othe fieldname (ie "signature") instead.
>
>
> : Date: Tue, 14 Oct 2014 12:08:26 +0330
> : From: Ali Nazemian <al...@gmail.com>
> : Reply-To: solr-user@lucene.apache.org
> : To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> : Subject: mark solr documents as duplicates on hashing the combination of
> some
> : fields
> :
> : Dear all,
> : Hi,
> : I was wondering how can I mark some documents as duplicate (just marking
> : for future usage not deleting) based on the hash combination of some
> : fields? Suppose I have 2 fields name "url" and "title" I want to create
> : hash based on url+title and send it to another field name "signature".
> If I
> : do that using solr dedup, it will be resulted to deleting duplicate
> : documents! So it is not applicable for my situation. Thank you very much.
> : Best regards.
> :
> : --
> : A.Nazemian
> :
>
> -Hoss
> http://www.lucidworks.com/
>
--
A.Nazemian
Re: mark solr documents as duplicates on hashing the combination of
some fields
Posted by Chris Hostetter <ho...@fucit.org>.
you can still use the SignatureUpdateProcessorFactory for your usecase,
just don't configure teh signatureField to be the same as your uniqueKey
field.
configure some othe fieldname (ie "signature") instead.
: Date: Tue, 14 Oct 2014 12:08:26 +0330
: From: Ali Nazemian <al...@gmail.com>
: Reply-To: solr-user@lucene.apache.org
: To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
: Subject: mark solr documents as duplicates on hashing the combination of some
: fields
:
: Dear all,
: Hi,
: I was wondering how can I mark some documents as duplicate (just marking
: for future usage not deleting) based on the hash combination of some
: fields? Suppose I have 2 fields name "url" and "title" I want to create
: hash based on url+title and send it to another field name "signature". If I
: do that using solr dedup, it will be resulted to deleting duplicate
: documents! So it is not applicable for my situation. Thank you very much.
: Best regards.
:
: --
: A.Nazemian
:
-Hoss
http://www.lucidworks.com/