You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Bram Van Dam <br...@intix.eu> on 2015/05/19 11:02:13 UTC

Deduplication

Hi folks,

I'm looking for a way to have Solr reject documents if a certain field
value is duplicated (reject, not overwrite). There doesn't seem to be
any kind of unique option in schema fields.

The de-duplication feature seems to make this (somewhat) possible, but I
would like it to provide the unique value myself, without having the
deduplicator create a hash of field values.

Am I missing an obvious (or less obvious) way of accomplishing this?

Thanks,

 - Bram

Re: Deduplication

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Wed, May 20, 2015 at 12:59 PM, Bram Van Dam <br...@intix.eu> wrote:

> >> Write a custom update processor and include it in your update chain.
> >> You will then have the ability to do anything you want with the entire
> >> input document before it hits the code to actually do the indexing.
>
> This sounded like the perfect option ... until I read Jack's comment:
>
> >
> > My understanding was that the distributed update processor is near the
> end
> > of the chain, so that running of user update processors occurs before the
> > distribution step, but is that distribution to the leader, or
> distribution
> > from leader to replicas for a shard?
>
> That would pose some potential problems.
>
> Would a custom update processor make the solution "cloud-safe"?
>

Starting with Solr 5.1, you have the ability to specify an update processor
on the fly to requests and you can even control whether it is to be
executed before any distribution happens or before it is actually indexed
on the replica.

e.g. you can specify processor=xyz,MyCustomUpdateProc in the request to
have processor xyz run first and then MyCustomUpdateProc and then the
default update processor chain (which will also distribute the doc to the
leader or from the leader to a replica). This also means that such
processors will not be executed on the replicas at all.

You can also specify post-processor=xyz,MyCustomUpdateProc to have xyz and
MyCustomUpdateProc to run on each replica (including the leader) right
before the doc is indexed (i.e. just before RunUpdateProcessor)

Unfortunately, due to an oversight, this feature hasn't been documented
well which is something I'll fix. See
https://issues.apache.org/jira/browse/SOLR-6892 for more details.


>
> Thx,
>
>  - Bram
>
>


-- 
Regards,
Shalin Shekhar Mangar.

Re: Deduplication

Posted by Bram Van Dam <br...@intix.eu>.
>> Write a custom update processor and include it in your update chain.
>> You will then have the ability to do anything you want with the entire
>> input document before it hits the code to actually do the indexing.

This sounded like the perfect option ... until I read Jack's comment:

>
> My understanding was that the distributed update processor is near the end
> of the chain, so that running of user update processors occurs before the
> distribution step, but is that distribution to the leader, or distribution
> from leader to replicas for a shard?

That would pose some potential problems.

Would a custom update processor make the solution "cloud-safe"?

Thx,

 - Bram


Re: Deduplication

Posted by Jack Krupansky <ja...@gmail.com>.
Shawn, I was going to say the same thing, but... then I was thinking about
SolrCloud and the fact that update processors are invoked before the
document is set to its target node, so there wouldn't be a reliable way to
tell if the input document field value exists on the target rather than
current node.

Or does the update processing only occur on the leader node after being
forwarded from the originating node? Is the doc clear on this detail?

My understanding was that the distributed update processor is near the end
of the chain, so that running of user update processors occurs before the
distribution step, but is that distribution to the leader, or distribution
from leader to replicas for a shard?


-- Jack Krupansky

On Tue, May 19, 2015 at 9:01 AM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 5/19/2015 3:02 AM, Bram Van Dam wrote:
> > I'm looking for a way to have Solr reject documents if a certain field
> > value is duplicated (reject, not overwrite). There doesn't seem to be
> > any kind of unique option in schema fields.
> >
> > The de-duplication feature seems to make this (somewhat) possible, but I
> > would like it to provide the unique value myself, without having the
> > deduplicator create a hash of field values.
> >
> > Am I missing an obvious (or less obvious) way of accomplishing this?
>
> Write a custom update processor and include it in your update chain.
> You will then have the ability to do anything you want with the entire
> input document before it hits the code to actually do the indexing.
>
> A script update processor is included with Solr allows you to write your
> processor in a language other than Java, such as javascript.
>
>
> https://lucene.apache.org/solr/5_1_0/solr-core/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html
>
> Here's how to discard a document in an update processor written in Java:
>
>
> http://stackoverflow.com/questions/27108200/how-to-cancel-indexing-of-a-solr-document-using-update-request-processor
>
> The javadoc that I linked above describes the ability to return "false"
> in other languages to discard the document.
>
> Thanks,
> Shawn
>
>

Re: Deduplication

Posted by Shawn Heisey <ap...@elyograg.org>.
On 5/19/2015 3:02 AM, Bram Van Dam wrote:
> I'm looking for a way to have Solr reject documents if a certain field
> value is duplicated (reject, not overwrite). There doesn't seem to be
> any kind of unique option in schema fields.
> 
> The de-duplication feature seems to make this (somewhat) possible, but I
> would like it to provide the unique value myself, without having the
> deduplicator create a hash of field values.
> 
> Am I missing an obvious (or less obvious) way of accomplishing this?

Write a custom update processor and include it in your update chain.
You will then have the ability to do anything you want with the entire
input document before it hits the code to actually do the indexing.

A script update processor is included with Solr allows you to write your
processor in a language other than Java, such as javascript.

https://lucene.apache.org/solr/5_1_0/solr-core/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html

Here's how to discard a document in an update processor written in Java:

http://stackoverflow.com/questions/27108200/how-to-cancel-indexing-of-a-solr-document-using-update-request-processor

The javadoc that I linked above describes the ability to return "false"
in other languages to discard the document.

Thanks,
Shawn


Re: Deduplication

Posted by Alessandro Benedetti <be...@gmail.com>.
What the Solr de-duplciation offers you is to calculate for each document
in input an Hash ( based on a set of fields).
You can then select two options :
 - Index everything, documents with same signature will be equals
- avoid the overwriting of duplicates.

How the similarity has is calculated is something you can play with and
customise if needed.

Clarified that, do you think can fit in some way, or definitely you are not
talking about deduce ?

2015-05-20 8:37 GMT+01:00 Bram Van Dam <br...@intix.eu>:

> On 19/05/15 14:47, Alessandro Benedetti wrote:
> > Hi Bram,
> > what do you mean with :
> > "  I
> > would like it to provide the unique value myself, without having the
> > deduplicator create a hash of field values " .
> >
> > This is not reduplication, but simple document filtering based on a
> > constraint.
> > In the case you want de-duplication ( which seemed from your very first
> > part of the mail) here you can find a lot of info :
>
> Not sure whether de-duplication is the right word for what I'm after, I
> essentially want a unique constraint on an arbitrary field. Without
> overwrite semantics, because I want Solr to tell me if a duplicate is
> sent to Solr.
>
> I was thinking that the de-duplication feature could accomplish this
> somehow.
>
>
>  - Bram
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Deduplication

Posted by Bram Van Dam <br...@intix.eu>.
On 19/05/15 14:47, Alessandro Benedetti wrote:
> Hi Bram,
> what do you mean with :
> "  I
> would like it to provide the unique value myself, without having the
> deduplicator create a hash of field values " .
> 
> This is not reduplication, but simple document filtering based on a
> constraint.
> In the case you want de-duplication ( which seemed from your very first
> part of the mail) here you can find a lot of info :

Not sure whether de-duplication is the right word for what I'm after, I
essentially want a unique constraint on an arbitrary field. Without
overwrite semantics, because I want Solr to tell me if a duplicate is
sent to Solr.

I was thinking that the de-duplication feature could accomplish this
somehow.


 - Bram

Re: Deduplication

Posted by Alessandro Benedetti <be...@gmail.com>.
Hi Bram,
what do you mean with :
"  I
would like it to provide the unique value myself, without having the
deduplicator create a hash of field values " .

This is not reduplication, but simple document filtering based on a
constraint.
In the case you want de-duplication ( which seemed from your very first
part of the mail) here you can find a lot of info :

https://cwiki.apache.org/confluence/display/solr/De-Duplication

Let me know for more detailed requirements!

2015-05-19 10:02 GMT+01:00 Bram Van Dam <br...@intix.eu>:

> Hi folks,
>
> I'm looking for a way to have Solr reject documents if a certain field
> value is duplicated (reject, not overwrite). There doesn't seem to be
> any kind of unique option in schema fields.
>
> The de-duplication feature seems to make this (somewhat) possible, but I
> would like it to provide the unique value myself, without having the
> deduplicator create a hash of field values.
>
> Am I missing an obvious (or less obvious) way of accomplishing this?
>
> Thanks,
>
>  - Bram
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England