You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Markus.Mirsberger" <ma...@gmx.de> on 2015/06/17 11:48:52 UTC

Dedupe in a SolrCloud

Hi,

I am trying to use the dedupe feature to detect and mark near duplicate 
content in my collections.
I dont want to prevent duplicate content. I woud like to detect it and 
keep it for further processing. Thats why Im using an extra field and 
not the documents unique field.

Here is how I added it to the solrConfig.xml :

      <requestHandler name="/update" class="solr.UpdateRequestHandler">
            <lst name="defaults">
                  <str name="update.chain">fill_signature</str>
            </lst>
      </requestHandler>

      <updateRequestProcessorChain name="fill_signature" 
processor="signature">
         <processor class="solr.RunUpdateProcessorFactory" />
      </updateRequestProcessorChain>

      <updateProcessor 
class="solr.processor.SignatureUpdateProcessorFactory" name="signature">
          <bool name="enabled">true</bool>
          <str name="signatureField">signature</str>
          <bool name="overwriteDupes">false</bool>
          <str name="fields">content</str>
          <str 
name="signatureClass">solr.processor.TextProfileSignature</str>
          <str name="quantRate">.2</str>
          <str name="minTokenLen">3</str>
      </updateProcessor>

When I initially add the documents to the cloud everything works as 
expected ..... the documents are added and the signature will be created 
and added.....perfect:)
The problem occours when I want to update an exisiting document. In that 
case the update.chain=fill_signature parameter will of course be set too 
and I get a bad request error.

I found this solr issue: https://issues.apache.org/jira/browse/SOLR-3473

Is it that problem I am running into?
Is it somehow possible to add parameters or set a specific update 
Handler when Im adding documents to the cloud using solrJ?
In that case I could ether set the update.chain manually and remove it 
from the request handler or write a second request Handler which I only 
use if I want set the signature field.
I know I can do that manually when Im using eg curl but is it also 
possible with SolrJ? :)


Thanks,
Markus





Re: Dedupe in a SolrCloud

Posted by Markus Mirsberger <ma...@gmx.de>.
Thanks :) 
exactly what I was looking for...as I only need to create the signature once this works perfect for me:)

Cheers,
Markus 


Sent from my iPhone

> On 17.06.2015, at 20:32, Shalin Shekhar Mangar <sh...@gmail.com> wrote:
> 
> Comments inline:
> 
> On Wed, Jun 17, 2015 at 3:18 PM, Markus.Mirsberger
> <ma...@gmx.de> wrote:
>> Hi,
>> 
>> I am trying to use the dedupe feature to detect and mark near duplicate
>> content in my collections.
>> I dont want to prevent duplicate content. I woud like to detect it and keep
>> it for further processing. Thats why Im using an extra field and not the
>> documents unique field.
>> 
>> Here is how I added it to the solrConfig.xml :
>> 
>>     <requestHandler name="/update" class="solr.UpdateRequestHandler">
>>           <lst name="defaults">
>>                 <str name="update.chain">fill_signature</str>
>>           </lst>
>>     </requestHandler>
>> 
>>     <updateRequestProcessorChain name="fill_signature"
>> processor="signature">
>>        <processor class="solr.RunUpdateProcessorFactory" />
>>     </updateRequestProcessorChain>
>> 
>>     <updateProcessor class="solr.processor.SignatureUpdateProcessorFactory"
>> name="signature">
>>         <bool name="enabled">true</bool>
>>         <str name="signatureField">signature</str>
>>         <bool name="overwriteDupes">false</bool>
>>         <str name="fields">content</str>
>>         <str
>> name="signatureClass">solr.processor.TextProfileSignature</str>
>>         <str name="quantRate">.2</str>
>>         <str name="minTokenLen">3</str>
>>     </updateProcessor>
>> 
>> When I initially add the documents to the cloud everything works as expected
>> ..... the documents are added and the signature will be created and
>> added.....perfect:)
>> The problem occours when I want to update an exisiting document. In that
>> case the update.chain=fill_signature parameter will of course be set too and
>> I get a bad request error.
>> 
>> I found this solr issue: https://issues.apache.org/jira/browse/SOLR-3473
>> 
>> Is it that problem I am running into?
> 
> You haven't pasted the complete error response so I am guessing a bit
> here. It is possible that you are running into the same problem i.e.
> the "signature" is being calculated again and the signature field not
> multi-valued, causes an error.
> 
>> Is it somehow possible to add parameters or set a specific update Handler
>> when Im adding documents to the cloud using solrJ?
> 
> Yes, any custom parameter can be added to a SolrJ request. There is a
> setParam(String param, String value) method available in
> AbstractUpdateRequest which can be used to set a custom update.chain
> for each SolrJ request.
> 
>> In that case I could ether set the update.chain manually and remove it from
>> the request handler or write a second request Handler which I only use if I
>> want set the signature field.
>> I know I can do that manually when Im using eg curl but is it also possible
>> with SolrJ? :)
>> 
>> 
>> Thanks,
>> Markus
> 
> 
> 
> -- 
> Regards,
> Shalin Shekhar Mangar.

Re: Dedupe in a SolrCloud

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
Comments inline:

On Wed, Jun 17, 2015 at 3:18 PM, Markus.Mirsberger
<ma...@gmx.de> wrote:
> Hi,
>
> I am trying to use the dedupe feature to detect and mark near duplicate
> content in my collections.
> I dont want to prevent duplicate content. I woud like to detect it and keep
> it for further processing. Thats why Im using an extra field and not the
> documents unique field.
>
> Here is how I added it to the solrConfig.xml :
>
>      <requestHandler name="/update" class="solr.UpdateRequestHandler">
>            <lst name="defaults">
>                  <str name="update.chain">fill_signature</str>
>            </lst>
>      </requestHandler>
>
>      <updateRequestProcessorChain name="fill_signature"
> processor="signature">
>         <processor class="solr.RunUpdateProcessorFactory" />
>      </updateRequestProcessorChain>
>
>      <updateProcessor class="solr.processor.SignatureUpdateProcessorFactory"
> name="signature">
>          <bool name="enabled">true</bool>
>          <str name="signatureField">signature</str>
>          <bool name="overwriteDupes">false</bool>
>          <str name="fields">content</str>
>          <str
> name="signatureClass">solr.processor.TextProfileSignature</str>
>          <str name="quantRate">.2</str>
>          <str name="minTokenLen">3</str>
>      </updateProcessor>
>
> When I initially add the documents to the cloud everything works as expected
> ..... the documents are added and the signature will be created and
> added.....perfect:)
> The problem occours when I want to update an exisiting document. In that
> case the update.chain=fill_signature parameter will of course be set too and
> I get a bad request error.
>
> I found this solr issue: https://issues.apache.org/jira/browse/SOLR-3473
>
> Is it that problem I am running into?

You haven't pasted the complete error response so I am guessing a bit
here. It is possible that you are running into the same problem i.e.
the "signature" is being calculated again and the signature field not
multi-valued, causes an error.

> Is it somehow possible to add parameters or set a specific update Handler
> when Im adding documents to the cloud using solrJ?

Yes, any custom parameter can be added to a SolrJ request. There is a
setParam(String param, String value) method available in
AbstractUpdateRequest which can be used to set a custom update.chain
for each SolrJ request.

> In that case I could ether set the update.chain manually and remove it from
> the request handler or write a second request Handler which I only use if I
> want set the signature field.
> I know I can do that manually when Im using eg curl but is it also possible
> with SolrJ? :)
>
>
> Thanks,
> Markus
>
>
>
>



-- 
Regards,
Shalin Shekhar Mangar.